Issue creating transactions

Incident Report for Signhost Verified Signing

Postmortem

What happened? 

On June 23, 2025, between 7:40 PM and 9:40 PM, we experienced an interruption in our API service. Message queues began growing rapidly, indicating that incoming requests were not being handled promptly. Upon investigation, it was determined that two of our containerized application servers were extremely slow in responding or had become effectively unresponsive. While the queued messages were eventually reprocessed, this was done with significant delays.   

What we did 

Due to limitations in our monitoring, it took some time before the issue was flagged. However, once the first alerts came in, our team quickly began troubleshooting by analyzing logs, reviewing queue metrics, and closely examining the performance of the affected servers. During the investigation, it became apparent that memory issues in the two affected containerized servers were leading to degraded performance. 

To mitigate the issue, the problematic servers were rebooted. To mitigate the issue, we performed a rollback to our non-containerized platform. This action restored normal performance and allowed the system to begin processing the backlog of queued messages. By 9:40 PM, all services had returned to full functionality, and all delayed messages had been successfully processed.  

What caused the issue? 

The performance degradation was traced back to memory management issues in two docker containers. These memory issues arose as part of challenges associated with our recent migration to a containerized environment. During this process, resource allocation parameters were not fully optimized for production workloads, leading to memory pressure that caused the containers to slow down significantly. 

Additionally, our monitoring systems did not flag the incident early enough due to thresholds that did not effectively capture the early signs of degradation. This delay extended the time required to resolve the issue.  

What are we doing next? 

To prevent similar issues from happening again and to ensure service continuity, we are taking the following steps: 

  1. Roll Back to Non-Containerized Platform (Temporary Measure): To stabilize services in the short term, we have reverted to our non-containerized platform until we have identified and resolved the underlying memory issues within our containerized environment. 
  2. Fix Memory Issues in Containerized Environment: We are reviewing and optimizing resource allocation parameters, particularly memory limits, to ensure the infrastructure can handle production workloads reliably without degradation. 
  3. Enhance Monitoring Systems: 

    1. We will recalibrate monitoring thresholds to detect early signs of resource strain, enabling proactive intervention before any impact on performance occurs. 
    2. To support this, we are transitioning to a more robust monitoring system, which will provide improved insights, faster anomaly detection, and more precise alerting across our infrastructure. 

Moving to containerized architecture is a critical part of our broader cloud migration strategy. Our ultimate vision is to fully transition to Entrust’s EU hosting infrastructure. This migration will allow us to leverage high-availability capabilities and failover mechanisms to ensure service continuity. Additionally, it will provide a scalable and resilient platform that can meet evolving customer needs while adhering to strict regional compliance requirements to ensure that data resides in designated jurisdictions.

Posted Jun 24, 2025 - 18:08 CEST

Resolved

We are experiencing an issue with creating transactions in our portal and API.
Posted Jun 23, 2025 - 21:30 CEST