On June 23, 2025, between 7:40 PM and 9:40 PM, we experienced an interruption in our API service. Message queues began growing rapidly, indicating that incoming requests were not being handled promptly. Upon investigation, it was determined that two of our containerized application servers were extremely slow in responding or had become effectively unresponsive. While the queued messages were eventually reprocessed, this was done with significant delays.
Due to limitations in our monitoring, it took some time before the issue was flagged. However, once the first alerts came in, our team quickly began troubleshooting by analyzing logs, reviewing queue metrics, and closely examining the performance of the affected servers. During the investigation, it became apparent that memory issues in the two affected containerized servers were leading to degraded performance.
To mitigate the issue, the problematic servers were rebooted. To mitigate the issue, we performed a rollback to our non-containerized platform. This action restored normal performance and allowed the system to begin processing the backlog of queued messages. By 9:40 PM, all services had returned to full functionality, and all delayed messages had been successfully processed.
The performance degradation was traced back to memory management issues in two docker containers. These memory issues arose as part of challenges associated with our recent migration to a containerized environment. During this process, resource allocation parameters were not fully optimized for production workloads, leading to memory pressure that caused the containers to slow down significantly.
Additionally, our monitoring systems did not flag the incident early enough due to thresholds that did not effectively capture the early signs of degradation. This delay extended the time required to resolve the issue.
To prevent similar issues from happening again and to ensure service continuity, we are taking the following steps:
Enhance Monitoring Systems:
Moving to containerized architecture is a critical part of our broader cloud migration strategy. Our ultimate vision is to fully transition to Entrust’s EU hosting infrastructure. This migration will allow us to leverage high-availability capabilities and failover mechanisms to ensure service continuity. Additionally, it will provide a scalable and resilient platform that can meet evolving customer needs while adhering to strict regional compliance requirements to ensure that data resides in designated jurisdictions.