Postmortem: Web Stack Outage on November 7, 2023 — Blunders in the Code
Duration: November 7, 2023, 09:00 AM — 12:30 PM (UTC)
Impact: Users endured a 30% increase in response times across all services, leading to a degraded user experience.
Root Cause: Suboptimal database query in the code resulted in inefficient resource usage, causing a database connection pool exhaustion.
Timeline:
09:00 AM: Issue detected through monitoring alerts indicating a spike in response times.
09:15 AM: DevOps team initiated investigation, suspecting potential code inefficiencies.
09:30 AM: Initial assumption focused on a poorly optimised algorithm affecting response times.
10:00 AM: Codebase analysed for performance bottlenecks, revealing a suboptimal database query.
10:30 AM: Database administrators involved to assess the impact of the inefficient query on connection pool usage.
11:00 AM: Realised the inefficient query led to rapid exhaustion of the database connection pool.
11:30 AM: Incident escalated to development teams; code refactoring initiated.
12:00 PM: Refactored code deployed, and database connection pool parameters adjusted.
12:30 PM: Response times normalised, and services restored to full capacity.
Root Cause and Resolution:
Root Cause: A poorly optimised database query in the code led to excessive resource consumption, causing a connection pool exhaustion.
Resolution: The code was refactored to optimise the database query, and connection pool parameters were adjusted to prevent future exhaustion.
Corrective and Preventative Measures:
Things to Improve/Fix:
Code Reviews: Strengthen code review processes to catch and rectify inefficient queries during development.
Performance Testing: Integrate comprehensive performance testing into the development pipeline to identify potential bottlenecks early on.
Tasks to Address the Issue:
Code Refactoring: Conduct a thorough review of the codebase to identify and refactor any other suboptimal queries.
Developer Training: Organise training sessions for developers to enhance their awareness of coding practices that impact system performance.
Automated Code Analysis: Implement automated tools to analyse code for performance issues as part of the continuous integration process.