Partial system outage
Incident Report for Evidos trust services
Postmortem

The past few days we saw delays on our API and queuing processes. Transaction creation could take around 30 seconds or would time out. We want to apologize for any problems experienced by our customers during this outage, and through this way want to inform our customers about mitigating steps we took to prevent this from reoccurring. We are committed to guarantee a high uptime, as you are used from our platform.

Problem

Past week we encountered three times a high load on our platform, which persisted for around 30 from start to end minutes. During this high load transactions where created after a delay, and in some occasions transaction creation resulted in time out messages for our customers in both portal and api.

We have been investigating these issues closely, as they where similar in nature and resulted in degraded performance on our systems. We have found that a certain service rebooted just before these issues ocurred and on reboot caused a very high load to our systems.

Fix

First we will make sure that this service reboots in a controlled fashion. This will give us granular control over database load, and therefore we expect our system to remain operational.

Secondly, we introduce more granular control over this service so when reboot occurs it can more easily be spread out to mitigate load, and to more easily identify any further bottlenecks in subservices.

These fixes have been introduced and will be further introduced today.

Mitigation

As communicated before we are in the midst of a migration to a new hosting party and to new hosting technology. This migration is expected to be completed end of this summer. New hosting technology will enable us to handle events and services in ways that these load issues do not occur anymore. A big part of our migration plan and the reason why we migrate to new technology is to further streamline our database and queuing processes so we take these findings and fixes into account.

This will help us guarantee our high uptime, and keep our customer’s environments operational and responsive in the future.

Posted Jul 07, 2021 - 10:30 CEST

Resolved
Signing service is operational again. We are zooming in on the cause of this issue and suspect a webservice which slows down transaction creation processes when it reboots. Fixing this problem is first priority for us and we will update this incident when full report and root cause fix is available.
Posted Jul 06, 2021 - 16:13 CEST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jul 06, 2021 - 15:48 CEST
Investigating
Creating of transactions is not possible at the moment. Please bare with us while we try to resolve this incident a.s.a.p.
Posted Jul 06, 2021 - 15:12 CEST
This incident affected: API and Portal.