Signhost Incident Postmortem - April 26, 2024
Impact:
- Users were unable to create transactions or experienced slow transaction creation between 11:15 CEST and 12:33 CEST.
- Signhost staff put the platform in maintenance mode to prevent overloading during the incident.
Root Cause:
An issue with a storage node, identified by our hosting provider, CloudVPS, caused problems with the storage of our main database instance. Our hosting provider is currently migrating data away from this malfunctioning node. These migrations caused incidents in our platform on April 22nd and today, April 26th.
Incident Timeline:
- 11:09 CEST: Users begin experiencing issues creating transactions on Signhost. Transactions are either impossible or significantly slower than usual.
- 11:15 CEST: Signhost staff are alerted to the problem and investigate.
- 11:30 CEST: We reached out to our hosting provider. Initially no clear problem was found.
- Between 11:15 CEST and 12:33 CEST: Signhost staff put the platform in maintenance mode on multiple occasions to prevent the system from overloading due to the storage issue.
- 12:33 CEST: The load subsided and the platform was deemed operational again.
- 13:00 CEST: The issue is identified as being related to the node migration issues from our hosting party.
- 14:23 CEST: We received final confirmation from CloudVPS migrating data away had finished and stoped the issue monitoring.
Resolution:
The issue got resolved by CloudVPS finishing their maintenance/migration work.
Prevention:
Our hosting parties development team is working on a new NVMe (Non-Volatile Memory Express) implementation for storage. This new implementation is expected to offer improved performance and reliability, so any emergency node migrations of the sort we’ve seen twice this week, are never needed anymore. Next to that, we made agreements so we are more clearly informed about any emergency migrations, so we know the root cause of the problem and can act sooner.
Next Steps:
- Signhost will continue to monitor the situation and communicate any updates to users.
- We will work with our hosting to ensure this NVMe implementation is executed on short notice, and we are informed on any planned maintenances in a more direct fasion.
- We will review our own incident response procedures with our hosting partner to ensure a more efficient response in the future.
We apologize for any inconvenience this incident may have caused