On Thursday September 26 starting around 10:25 CEST users were unable to create transactions caused by an issue with a database node.
After having noticed the issue occurring, we directly tried to analyze what the root cause may be, and we have immediately contacted our hosting party to also check their logging for any issues.
Simultaneously we put our platform in maintenance mode to prevent overloading while continuing investigating the issue.
Our hosting provider discovered a problem with one of the database servers. We moved the first database to a different server, which helped improve things. After bringing our platform back online we noticed the delay occurring again and decided to immediately put the platform in maintenance mode again.
We decided to also migrate the second database node. However, when we tried to move the second database, our hosting party ran into some issues. As a solution, we switched the commits database to the first server. After making this change, the platform stabilized.
We gradually start bringing the servers online and after about half an hour we were fully operational again.
Our hosting party identified high traffic on the server where our database node was running, which led to performance problems. To fix this, we tried moving our database servers to different hardware. Moving the first server helped improve the situation somewhat, but we encountered problems when trying to move the second server. Our cloud hosting provider found an issue with this second server, which prevented the migration. This caused the delay and the downtime of the platform.
We will work to gain better insights into the status of our database servers, including monitoring their performance and load more effectively. This will help us detect issues earlier and take corrective action faster.
We will keep in close contact with our hosting partner to prevent this from happening again.