RCA: Signhost Postback Service Slowdown - July 11th, 2024
Date: July 12th, 2024
Summary: On July 11th, 2024, Signhost's postback service experienced slowdowns starting at approximately 10:30 CEST. This resulted in timeouts and locks on the database, preventing postbacks from being sent. The issue was resolved by rebooting the postback system at 12:45 CEST, which allowed the system to catch up on the backlog and resuming real time postback sending. for most customers. A remaining problem queue was identified and purged, and customer postback services were fully restored after updating statuses via GET calls.
Root Cause: The RCA identified the following root causes for the postback service slowdown:
- Queue Overload: The postback system encountered an overload of tasks, leading to timeouts and database locks. This overload could be due to a surge in postback activity or inefficiencies in the current queueing mechanism.
Contributing Factors:
- Limited Logging: The current logging system does not capture internal postback queue metrics effectively, making it difficult to proactively identify queue overload situations.
- Queueing Mechanism Limitations: The current queueing mechanism might not be sufficient for handling peak loads or may not be optimized for efficient processing of postbacks.
Timeline of Events:
- 10:30 CEST, July 11th: Postback service slowdown begins due to queue overload, resulting in timeouts and database locks.
- Timeframe between 10:30 CEST - 12:45 CEST, July 11th: Investigation and troubleshooting efforts are undertaken.
- 12:45 CEST, July 11th: Postback system reboot is performed, allowing the system to process the backlog.
- Post-reboot: A remaining problem queue is identified, purged, and customer postback services are restored via GET call updates.
Corrective Actions:
Preventative Measures:
- Proactive Monitoring: Implement proactive monitoring of queue health using the enhanced logging data to identify and address potential overload situations before they impact service delivery.
- Regular System Reviews: Conduct periodic reviews of the postback system to assess performance and identify areas for further improvement.
Conclusion: The postback service slowdown was caused by queue overload, leading to timeouts and database locks. By improving logging, reviewing the queueing mechanism, and implementing permanent fixes, Signhost can prevent similar incidents in the future. Proactive monitoring and regular system reviews will further ensure the stability and efficiency of the postback service.