2017-09-11: PPO service outage

On Monday, 11 September, a change to our Single Sign-On (SSO) functionality containing a bug was deployed to production which prevented unauthenticated SSO users from signing in. Herewith the details of incident, the root cause and our mitigating actions.

At 13:58 PM (UTC+2), a deployment was done to our application servers containing new SSO features. Almost immediately we started receiving notifications of errors occurring during the SSO sign-in process. It was clear that the latest change set contained a bug and the rollback procedure was started.

By 14:30, the rollback was completed and SSO users were once again able to log in. It should be noted that non-SSO users as well as SSO users that were already authenticated were unaffected.

Following the investigation, it was found that an upgrade to a third-party SAML library we use introduced a breaking change. Not anticipating this, we failed to identify the full scope of testing required for the intended change.

It was clear to us that a fundamental rule for introducing incremental change was broken here. The updated third-party library should have been introduced and tested first, before the new SSO feature was implemented. In future, we will strive to keep our changes as small as possible. Our rollback procedures also require some additional improvement since it took longer than expected and required manual intervention.

We would like to apologize for any inconvenience that resulted from this incident.

Facebook
Facebook
Google+
https://www.go2ppo.com/incidents/2017-09-11-ppo-service-outage-2">
Follow by Email
LinkedIn
Jimmy Hekma

Author: Jimmy Hekma

Jimmy is one of the founders of Project Portfolio Office and is the companies Chief Technology Officer and heads up the Product Management team.