Frank Galos
2 min readNov 12, 2023

Postmortem: Web Stack Outage on November 7, 2023 — Blunders in the Code

Duration: November 7, 2023, 09:00 AM — 12:30 PM (UTC)

Impact: Users endured a 30% increase in response times across all services, leading to a degraded user experience.

Root Cause: Suboptimal database query in the code resulted in inefficient resource usage, causing a database connection pool exhaustion.

Timeline:

09:00 AM: Issue detected through monitoring alerts indicating a spike in response times.

09:15 AM: DevOps team initiated investigation, suspecting potential code inefficiencies.

09:30 AM: Initial assumption focused on a poorly optimised algorithm affecting response times.

10:00 AM: Codebase analysed for performance bottlenecks, revealing a suboptimal database query.

10:30 AM: Database administrators involved to assess the impact of the inefficient query on connection pool usage.

11:00 AM: Realised the inefficient query led to rapid exhaustion of the database connection pool.

11:30 AM: Incident escalated to development teams; code refactoring initiated.

12:00 PM: Refactored code deployed, and database connection pool parameters adjusted.

12:30 PM: Response times normalised, and services restored to full capacity.

Root Cause and Resolution:

Root Cause: A poorly optimised database query in the code led to excessive resource consumption, causing a connection pool exhaustion.

Resolution: The code was refactored to optimise the database query, and connection pool parameters were adjusted to prevent future exhaustion.

Corrective and Preventative Measures:

Things to Improve/Fix:

Code Reviews: Strengthen code review processes to catch and rectify inefficient queries during development.

Performance Testing: Integrate comprehensive performance testing into the development pipeline to identify potential bottlenecks early on.

Tasks to Address the Issue:

Code Refactoring: Conduct a thorough review of the codebase to identify and refactor any other suboptimal queries.

Developer Training: Organise training sessions for developers to enhance their awareness of coding practices that impact system performance.

Automated Code Analysis: Implement automated tools to analyse code for performance issues as part of the continuous integration process.

Frank Galos
Frank Galos

Written by Frank Galos

0 Followers

A Software Engineer from Tanzania

No responses yet