Issues with transaction creation

Incident Report for Signhost Verified Signing

Postmortem

Signhost Incident Postmortem - April 26, 2024

Impact:

Users were unable to create transactions or experienced slow transaction creation between 11:15 CEST and 12:33 CEST.
Signhost staff put the platform in maintenance mode to prevent overloading during the incident.

Root Cause:

An issue with a storage node, identified by our hosting provider, CloudVPS, caused problems with the storage of our main database instance. Our hosting provider is currently migrating data away from this malfunctioning node. These migrations caused incidents in our platform on April 22nd and today, April 26th.

Incident Timeline:

11:09 CEST: Users begin experiencing issues creating transactions on Signhost. Transactions are either impossible or significantly slower than usual.
11:15 CEST: Signhost staff are alerted to the problem and investigate.
11:30 CEST: We reached out to our hosting provider. Initially no clear problem was found.
Between 11:15 CEST and 12:33 CEST: Signhost staff put the platform in maintenance mode on multiple occasions to prevent the system from overloading due to the storage issue.
12:33 CEST: The load subsided and the platform was deemed operational again.
13:00 CEST: The issue is identified as being related to the node migration issues from our hosting party.
14:23 CEST: We received final confirmation from CloudVPS migrating data away had finished and stoped the issue monitoring.

Resolution:

The issue got resolved by CloudVPS finishing their maintenance/migration work.

Prevention:

Our hosting parties development team is working on a new NVMe (Non-Volatile Memory Express) implementation for storage. This new implementation is expected to offer improved performance and reliability, so any emergency node migrations of the sort we’ve seen twice this week, are never needed anymore. Next to that, we made agreements so we are more clearly informed about any emergency migrations, so we know the root cause of the problem and can act sooner.

Next Steps:

Signhost will continue to monitor the situation and communicate any updates to users.
We will work with our hosting to ensure this NVMe implementation is executed on short notice, and we are informed on any planned maintenances in a more direct fasion.
We will review our own incident response procedures with our hosting partner to ensure a more efficient response in the future.

We apologize for any inconvenience this incident may have caused

Posted Apr 26, 2024 - 17:55 CEST

Resolved

This problem has been resolved. We will post up a post mortem after we have received all information from our hosting provider.

Posted Apr 26, 2024 - 14:42 CEST

Monitoring

The service is stable for now, we are monitoring the situation with our hosting party.

Posted Apr 26, 2024 - 12:33 CEST

Investigating

After restoring the services, we see the load peaking again. The problem is not solved, and we are investigating with the highest urgency.

Posted Apr 26, 2024 - 12:11 CEST

Update

We are continuing to monitor for any further issues.

Posted Apr 26, 2024 - 12:04 CEST

Monitoring

The load has subsided and our system is active again. We are monitoring the situation.

Posted Apr 26, 2024 - 12:03 CEST

Identified

We see a very high peak load and have put our systems in maintenance until this subsides. We are in close contact with our hosting provider and expect to be operational as soon as possible.

Posted Apr 26, 2024 - 11:49 CEST

Investigating

We are currently experiencing issues with creating transactions which results in error 500.
We are investigating the issue.

Posted Apr 26, 2024 - 11:19 CEST

This incident affected: API and Portal.