So, in the wake of last week’s problems on LyricWiki, the system was finally purring again at full-steam when this morning at about 1am, our data center had its first-ever full power-outage. This is the kind of thing that isn’t supposed to happen since good datacenters (like LyricWiki’s) have power from multiple sources and generators that kick on if all of those fail. But once in a while a human error will mess the whole thing up and briefly cut the power. The power was only out for about 1 minute which basically just causes the computers to reboot.
This normally would have been fine… the servers are configured in a way that they are all supposed to jump right in where they left off. Once all of them are up, the site should automatically work again. “Should”. For some reason, the master-slave replication is broken again. The slave is trying to read a position in the log-file that doesn’t exist. So yet again, I think we’re going to have to start the replication all over. This means about an hour or more of downtime.
For now, I just disabled the slave (to prevent weird behavior) so the site is going to be working with a lot less resources than it normally is. I’m planning to do the outage at night when the traffic is lower. The site will potentially be slower than normal today since it doesn’t have the slave-server helping out.
All-in-all it could have been a <strong>lot</strong> worse since the downtime so far has only been a minute or so, but please be aware that there will be potentailly an hour or two of downtime some time tonight/tomorrow-morning.
Posted by Sean Colombo
Posted by Sean Colombo