On March 29th we encountered an incident where after server downtime at night some of our services did not reboot correctly .https://status.signhost.com/incidents/mhc18lwx58cz
After solving this issue, during the afternoon a day later on March 30th, we still saw some unexpected behaviour by some services. We found the underlying issue and rebooted our services. We reboot our redundant services one by one so this has no impact on ongoing transactions. After this reboot 17:00 CET we sudddenly saw errors arising . Transaction creation was still possible but in a small percentage of cases resulted in retryable error 500s. After 30 minutes we decided to block transaction creation to prevent further errors queueing up and to be able to fully diagnose the cause of the issue.
Around 18:15 we found the cause of the errors, a client broke down because of the reboot and in this specific scenario escaped our automatic monitoring and logging in place. Full functionality was immediately restored and we added more logging to identify this problem if it would further occur in the future. Futhermore we improved our reboot behaviour so clients keep behaving as expected if rebooting again after downtime.
Between 17:30 and 18:30 transaction creation was not possible. We have taken steps to prevent such downtime in the future.