Summary
From September 14 around 15:00 UTC+2 we saw problems arise where the flow of events on our platform would slow down considerably. This resulted in our platform sometimes not accepting new transactions and error messages shown to users. After 10 minutes this behaviour went away again. This occurred for a few more days, every time around the same time (15:00 UTC +2). We diagnosed and solved this problem after being in close contact with our hosting party.
Problem
We saw multiple times in a row a problem where our database hard disk activity would spike without clear cause. We started analysis right away to find the cause of the issue, but because these hard disk activity spikes would only intermittently occur for a few minutes and then go away again, it was hard to find the cause of the issue right away. After searching our own processes and logging, we could not find the cause directly.
We contacted our hosting party with all found logging and information. During the research at the 23rd , we also switched off non essential platform services such as user creation, to make sure behaviour was caused by the area we suspected. After this research we found out that some configuration on the hosting party end was reset to default values, which resulted in our harddisk behaviour not being optimized like it was before. Restoring this configuration initially caused some more downtime as an emergency reboot was needed.
Solution
On September 23rd we restored the configuration together with our hosting party. This solved the issues.
Mitigation
We made better agreements about the specific configuration used with our hosting party, so situations like this with spiking load at specific moments will not occur again. Furthermore, we increased our disk IO capacity to have even more wiggle room when high hard disk load does occur.