Past week we saw a malfunction in the mail service of our signing platform. Email invitations for signing were sent out ocassionally without content. We want to apologize for any problems experienced by our customers during this outage, and through this way want to inform our customers about mitigating steps we took to prevent this from reoccurring. We are committed to guarantee a high uptime, as you are used from our platform.
Intro
The 14th of June we saw errors arising in our email log. We activated an emergency protocol to prevent more of these mails going out, but this resulted in some transactions with mails sent exactly during this window to be unable to continue. This meant that for about 60 minutes, sending new transaction invitation mails were queued for later sending, and some in progress transactions with mails and sign operations during this queue were unable to continue.
Problem
On June the 14th around 16:33 CET we encountered errors in our email log after releasing a regular script version update in our email functionality. Even though this update was both automatically tested, and manually tested in our staging environment, this release still caused some issues with email being queued. Normally this would not be a problem. We release new functionality all the time and with our continuous release policy we can easily roll back so customer signing is not impacted.
However, after the direct rollback we still did not see the desired behaviour. We had to do a re-rollback to fully restore functionality. This meant that a quick issue was prolonged and therefore impacted customer’s signing flows. For around 60 minutes customers could encounter issues, such as:
To analyse if you were impacted, the first pointer might be that a transaction remains ‘in progress'. You can check if that transaction is actually impacted by the issue, by checking the transaction details page for this transaction for mails sent out between 16:33 until 17:30 CET. Creating a new transaction will help, you can always contact us to have us diagnose your transaction in detail.
Fix
After the rerollback this problem was fixed. We had to do some diagnosing to make sure the rerollback would work, this let the issue take longer to fix than we would like.
Mitigation
In order to mitigate these problems in the future, we have checked that the test we do in our automatic test scripts are expanded, and that our rollback protocol is updated to take more variables and data into account, something that this issue brought to light.
This makes sure that:
This will help us guarantee our high uptime, and keep our customer’s environments operational and responsive in the future.