There were several hours (about 5) of downtime this morning. We’re back now.
The problems don’t appear to have been serious and were simple to fix, the reason the downtime was so long was just that I was asleep and the texts to my cellphone didn’t wake me up.
Although this hasn’t been a problem until now, it’s certainly not the type of error that should be allowed to repeat itself. Here are some of the plans to prevent this from happening again:
- Remove the false-positives: The system has been sending some pages in the middle of the night for other non-urgent things for about a week. Those unneeded alerts will be cleaned up so that I’m not used to getting unimportant pages.
- Auto-recovery: This was a pretty simple fix, and there are a few types of problems that could be fixed automatically. If I have some time today, I’ll be working on a system that will look at a bunch of info across the servers and see if a problem is easily automatically fixable and try to fix it if I haven’t responded to the first two pages (ie: after 15 minutes of downtime).
- Ringtone change: The reason ambulances switch up their sirens is that people notice when things are different. The false-positives got me used to sleeping through that tone, so I’ve changed it to another (conveniently, far more annoying) tone.
Hopefully, we’ll be back to uptime-nirvana again.