Application Server Failure - Project Portfolio Office

On 18 September 2015, between 14:08 and 15:09 (UTC+0), approximately 10% of page requests served by one of our application servers failed. These failures affected 14 customers who made requests to the particular application server during the period mentioned. No client data is stored on our application servers and no data was lost or impacted as a result of this incident.

Root cause

An unexpected increase in certain temporary files resulted in an out of disk space situation on the root volume of the application server. Our availability monitoring system failed to detect the problem since the majority of requests were still being served successfully. Although our secondary monitoring processes detected the errors, it was not flagged as critical and not escalated to the appropriate technical staff in a timely manner.

Remediation

The specific circumstances that led to the out of disk space condition will be addressed as part of a larger architectural change that will be implemented early in October 2015. We will also be implementing additional monitoring and escalation processes which would have detected the low disk space situation and classified it as a critical error. We will also be looking at the specific set of circumstances which resulted in the delayed response from the technical staff to ensure a more timely response.

We would like to apologise to all our customers that were affected and assure you that we will be doing everything in our power to ensure that a similar outage does not re-occur