Issues creating and viewing transactions

Incident Report for Signhost Verified Signing

Postmortem

What happened?
Yesterday, March 13, some of our customers experienced issues with transactions failing to complete and occasional unreliability in our Portal when accepting user input. We understand how frustrating this can be, and we sincerely apologize for any inconvenience caused.

What we did
Our team immediately began investigating by analyzing logs, webserver telemetry, and other platform metrics. At 13:09 CET, we identified and restarted a web application node that was not functioning correctly. This initially resolved the errors. However, we continued monitoring and noticed other issues, such as web connections growing steadily without decreasing as expected. To ensure stability, we removed the problematic node from our load balancer, which restored normal platform operations.

What caused the issue?
The issue stemmed from a release we deployed on Thursday, March 13, at 12:28 CET. One webserver node began experiencing intermittent database connection failures, which led to the errors observed. Over time, these database connection issues persisted, causing further complications. Upon further investigation, it turned out that our routine application reboot process that is fired upon reboot and release did not cleanly close and reopen all connections, causing connections to keep rising and in the end destabilize our platform.

What are we doing next?
To prevent similar issues and ensure uninterrupted service, we are taking the following steps:

  • Enhanced monitoring: Improving our system health monitoring to detect issues faster and reduce resolution times.
  • Service provider collaboration: Working with our service provider to address the root cause of the database connection errors.
  • Platform improvements: We identified some core improvements in how our services reconnect upon reboot. This way continuous releasing and rebooting continues to work as expected: without any platform impact with failovers to our multiple backup nodes.

These measures are part of our ongoing commitment to providing a reliable, high-availability platform. We are dedicated to delivering a seamless signing service 24/7 and will continue to prioritize improvements to meet your expectations. We follow a continuous releasing process that allows us to release multiple times a day and even a week, and we are committed to keep this a flawless and fast process to enhance and secure our platform without any scheduled, let alone accidental, downtime. Thank you for your patience and understanding as we work to make our platform even better.

Posted Mar 14, 2025 - 15:14 CET

Resolved

This incident has been resolved.
Posted Mar 13, 2025 - 14:58 CET

Monitoring

A fix is implemented and we're monitoring the results.
Posted Mar 13, 2025 - 13:42 CET

Investigating

We are currently investigating this issue.
Posted Mar 13, 2025 - 13:21 CET
This incident affected: API and Portal.