Past week we saw a major outage in our signing platform. Transaction creation in both portal and API was delayed. We want to apologize for any problems experienced by our customers during this outage, and through this way want to inform our customers about mitigating steps we took to prevent this from reoccurring. We are committed to guarantee a high uptime, as you are used from our platform.
Intro
Last week we saw our transaction processing times for our signing service rapidly getting higher. We activated an emergency protocol to stop the queue from getting worse. This meant that for about 2 hours creating new transactions in both portal and API was not possible.
Problem
On Februari the 4th around 13:30 CET we encountered a problem with our database, which meant that new transactions could not be processed. We automatically activated an emergency protocol to prevent timeouts and a big queue buildup, this resulted in ‘not available’ messages and error QR codes to at least make customers directly aware of the fact that transaction creation was not possible.
All transactions queued before activation of the emergency protocol were parked, for a later re-entry into our regular signing service.
A database error should be quickly diagnosed and fixed. A compounding problem was that some time earlier we migrated to a new server environment. In this new environment not all rights were granted to us as we were used to in our previous environment. We suddenly had to depend on our web hosting party to fully diagnose and fix this problem. Because of this extra layer of communication, a problem which we should be able to diagnose and fix right away, suddenly took longer than expected to solve.
Fix
After getting into contact with our hosting party, the problem was diagnosed and fixed. Around 15:32 CET our Portal and API functionality was restored. Around midnight the next day (Friday) we restored the previously parked transactions. This can lead to transactions being processed around midnight Friday, even though actual creation or interaction was earlier.
Mitigation
In order to mitigate these problems in the future, we have checked that the rights we have on our server environment enable us to do quick response and repair on our platform. Furthermore, we expanded our logging and testing capabilities to nip similar problems in the bud at an earlier stage. Finally, we are in the progress of retooling some database services to prevent and improve bottlenecks.
This makes sure that:
This will help us guarantee our high uptime, and keep our customer’s environments operational and responsive in the future.