On September 2 and 3, 2025, our platform experienced a major service outage that significantly impacted availability and transaction processing. The disruption was caused by a code deployment that introduced a bug resulting in excessive duplicate message processing. This overwhelmed our database connections and led to downtime across both days.
On September 2, we deployed a change and observed an outage shortly after. The team initiated a rollback to restore functionality. After reviewing the change, we re-deployed the same change later that day, believing the issue was unrelated. However, on September 3, a second major outage occurred around the same time. During this time, all SQL queries across the Signhost services failed due to database saturation.
After analyzing message logs, we had to conclude the code change was the root cause. The deployment intermittently triggered bursts of hundreds of thousands of duplicate messages for the same transaction events. These spikes occurred after hours of normal operation, making the issue difficult to detect early. Once confirmed, we permanently rolled back to the previous stable version. The system has remained stable since.
The root cause was a still unidentified bug in the change we deployed on September 2, that sporadically generated massive volumes of duplicate messages. These bursts overwhelmed the database, causing complete service outages. The duplicates were not caused by infinite loops but occurred in sudden, high-volume spikes.
To prevent similar incidents and improve our deployment safety for these kinds of changes, we are taking the following steps:
Deployment safeguards:
Monitoring improvements:
Root cause analysis: