2016-10-07: PPO Service Outage

On Friday, 7 October, a bug was introduced into our production environment that resulted in some users not being able to authenticate and necessitated us to take the service off-line to perform a rollback. Below is a detailed explanation of the events surrounding the incident, our investigation into the root cause, as well as the actions planned for preventing a similar occurrence in future.

At 12:39 PM (UTC+0), a routine deployment to our application servers was started. Reports of users experiencing difficulties to log in started at around 13:05. We also saw a number of IP addresses being blocked on our servers due to authentication failures exceeding our thresholds.

A quick investigation saw intermittent behaviour but a bug was confirmed. As a precautionary measure, the service was taken offline at 13:16 and a rollback was performed. Full service was restored at 13:35.

Following a more detailed investigation it was discovered that users that had an active session at the time of the deployment were unaffected by the bug. Users that logged on subsequently however were unable to authenticate.

The bug was introduced while refactoring a caching pattern that impacted user objects in our data access layer. The change was made as part of a distributed cache we are busy implementing.

Testing in most practical cases are performed by one tester at a time, unless otherwise specified. In this particular case, testing was performed by multiple testers, but not concurrently. The bug however only surfaced in a multi-user environment on a single instance of PPO. Unfortunately, the particular circumstances under which the bug occurred were not considered to be impacted and as a result not tested.

As a preventative measure we will improve our processes around identifying and dealing with security related changes. Automated testing in this particular area will be paid more attention. We also felt that the rollback procedure could have been performed faster and improvements in this regard will be made.

We would like to apologize for any inconvenience that resulted from this incident.

Facebook
Facebook
Google+
https://www.go2ppo.com/incidents/2016-10-07-ppo-service-outage-2">
Follow by Email
LinkedIn
Jimmy Hekma

Author: Jimmy Hekma

Jimmy is one of the founders of Project Portfolio Office and is the companies Chief Technology Officer and heads up the Product Management team.