Timeouts in postbacks
Incident Report for Signhost Verified Signing
Postmortem

RCA: Signhost Postback Service Slowdown - July 11th, 2024

Date: July 12th, 2024

Summary: On July 11th, 2024, Signhost's postback service experienced slowdowns starting at approximately 10:30 CEST. This resulted in timeouts and locks on the database, preventing postbacks from being sent. The issue was resolved by rebooting the postback system at 12:45 CEST, which allowed the system to catch up on the backlog and resuming real time postback sending. for most customers. A remaining problem queue was identified and purged, and customer postback services were fully restored after updating statuses via GET calls.

Root Cause: The RCA identified the following root causes for the postback service slowdown:

  • Queue Overload: The postback system encountered an overload of tasks, leading to timeouts and database locks. This overload could be due to a surge in postback activity or inefficiencies in the current queueing mechanism.

Contributing Factors:

  • Limited Logging: The current logging system does not capture internal postback queue metrics effectively, making it difficult to proactively identify queue overload situations.
  • Queueing Mechanism Limitations: The current queueing mechanism might not be sufficient for handling peak loads or may not be optimized for efficient processing of postbacks.

Timeline of Events:

  • 10:30 CEST, July 11th: Postback service slowdown begins due to queue overload, resulting in timeouts and database locks.
  • Timeframe between 10:30 CEST - 12:45 CEST, July 11th: Investigation and troubleshooting efforts are undertaken.
  • 12:45 CEST, July 11th: Postback system reboot is performed, allowing the system to process the backlog.
  • Post-reboot: A remaining problem queue is identified, purged, and customer postback services are restored via GET call updates.

Corrective Actions:

  • Improve Logging: Enhance logging practices to capture detailed metrics on internal postback queues, enabling better identification of potential overload scenarios.
  • Review Queueing Mechanism: Evaluate the current queueing mechanism and implement improvements for better handling of peak loads and efficient postback processing. This might involve exploring alternative queueing solutions.
  • Implement Permanent Fixes: Develop and implement permanent solutions to address queue overload issues, such as:

    • Optimizing the current queueing mechanism for postback processing.
    • Migrating to a more robust queueing system if necessary.

Preventative Measures:

  • Proactive Monitoring: Implement proactive monitoring of queue health using the enhanced logging data to identify and address potential overload situations before they impact service delivery.
  • Regular System Reviews: Conduct periodic reviews of the postback system to assess performance and identify areas for further improvement.

Conclusion: The postback service slowdown was caused by queue overload, leading to timeouts and database locks. By improving logging, reviewing the queueing mechanism, and implementing permanent fixes, Signhost can prevent similar incidents in the future. Proactive monitoring and regular system reviews will further ensure the stability and efficiency of the postback service.

Posted Jul 12, 2024 - 16:30 CEST

Resolved
This incident has been resolved.
Posted Jul 11, 2024 - 18:28 CEST
Monitoring
The postback delays are resolved and we see postbacks being delivered in real time again.
Posted Jul 11, 2024 - 12:59 CEST
Investigating
We are seeing timeouts in sending postbacks that lead to a delay in the delivery of postbacks.
We are investigating the issue.
Posted Jul 11, 2024 - 12:15 CEST
This incident affected: API and Portal.