Major incident

January 31, 2025

Today we had a major incident, with the application not available for some hours.

We received the first report of the application now working at around 08:00 UTC, and we were able to restore the service by 14:06 UTC; this make it six hours where the application was not available.

We are not really happy with this issue; we have investigated it, and it seems the issue was caused by a service not being able to properly restart after a security update was applied to the underlying infrastructure. We have collected the available evidences, and we’ll try to replicate the problem in a test environment to understand how to change the configurations to avoid this happening again the next time.

Reliability

We had a few minor issues lately (with the application not responding for brief period of time); but the previous major incident we had was in … December 2017!

Not being able to account for the minor issues we had, computing a reliability percentage does not make much sense, but we are still pretty proud of the perceived reliability of our service.

But this doesn’t mean we don’t strive to do even better. 😅

Feedback

One nice side effect that keeps on repeating each time we have these kind of issues, is the multiple messages we receive from our users. They were obviously reporting the issue, but most of them included also some praise for our service.

And this means so much for us, <kidding> we are starting considering causing periodic incidents, just to receive some feedback from our users! 🤪</kidding>

Handling failures

The current best anthidote to this service disruptions is to use the Offline Copy feature to keep a local copy of all your data; but we know its use is not super convenient. For this reason, we are currently working on a completely new feature (Device Sync) for the new /epsilon version. This new feature is still not super reliable, so we are not yet confident to release it for public use; but if you are brave enough, you can already try it out.