Major (reliability) incident

January 31, 2025

Today we had a major (reliability) incident, with the application not available for some hours.

We received the first report of the application now working at around 08:00 UTC, and we were able to restore the service by 14:06 UTC; this make it six hours where the application was not available.

We are not really happy with this issue; we have investigated it, and it seems the issue was caused by a service not being able to properly restart after a security update was applied to the underlying infrastructure. We have collected the available evidences, and we’ll try to replicate the problem in a test environment to understand how to change the configurations to avoid this happening again the next time.

Reliability

We had a few minor issues lately (with the application not responding for brief period of time); but the previous major incident we had was in … December 2017!

Not being able to account for the minor issues we had, computing a reliability percentage does not make much sense, but we are still pretty proud of the perceived reliability of our service.

But this doesn’t mean we don’t strive to do even better. 😅

Feedback

One nice side effect that keeps on repeating each time we have these kind of issues, is the multiple messages we receive from our users. They were obviously reporting the issue, but most of them included also some praise for our service.

And this means so much for us, <kidding> we are starting considering causing periodic incidents, just to receive some feedback from our users! 🤪</kidding>

Handling failures

The current best anthidote to this service disruptions is to use the Offline Copy feature to keep a local copy of all your data; but we know its use is not super convenient. For this reason, we are currently working on a completely new feature (Device Sync) for the new /epsilon version. This new feature is still not super reliable, so we are not yet confident to release it for public use; but if you are brave enough, you can already try it out.

Edit

On Monday, February 24th, we updated the title and the description (but not the slug, to avoid breaking references) to include the "(reliability)" note. Reading the blog post after some while, we realized that “incident” could also be used for security/integrity issues, that we definitely did not have to address.

clipperz

keep it to yourself

Major (reliability) incident

Reliability

Feedback

Handling failures

Edit

Make a donation

Looking for something?

Tools

Password generator

From the blog

/epsilon updates: Passkey Login

Major (reliability) incident

/epsilon updates: Device Sync

/epsilon Public Preview

New Tools

UI evolution

Finally, some new news

Moved

Moving schedule

Moving …

Updates.

Reliability

Feedback

Handling failures

Edit

Make a donation

Looking for something? Sorry; this feature requires Javascript to be enabled, as searches are performed locally to your browser.

Tools

From the blog

Social

Looking for something?