When it rains it pours :P

June 3, 2008

So, in the wake of last week’s problems on LyricWiki, the system was finally purring again at full-steam when this morning at about 1am, our data center had its first-ever full power-outage.  This is the kind of thing that isn’t supposed to happen since good datacenters (like LyricWiki’s) have power from multiple sources and generators that kick on if all of those fail.  But once in a while a human error will mess the whole thing up and briefly cut the power.  The power was only out for about 1 minute which basically just causes the computers to reboot.

This normally would have been fine… the servers are configured in a way that they are all supposed to jump right in where they left off.  Once all of them are up, the site should automatically work again.  “Should”.  For some reason, the master-slave replication is broken again.  The slave is trying to read a position in the log-file that doesn’t exist.  So yet again, I think we’re going to have to start the replication all over.  This means about an hour or more of downtime.

For now, I just disabled the slave (to prevent weird behavior) so the site is going to be working with a lot less resources than it normally is.  I’m planning to do the outage at night when the traffic is lower.  The site will potentially be slower than normal today since it doesn’t have the slave-server helping out.

All-in-all it could have been a <strong>lot</strong> worse since the downtime so far has only been a minute or so, but please be aware that there will be potentailly an hour or two of downtime some time tonight/tomorrow-morning.


Don’t get pwned like we did!

June 1, 2008

Earlier this week we had some really strange errors on the site (detailed description here) that were caused by a mySQL bug which will eventually burn anyone who is running master/slave replication.

I mentioned that you’d need to write a script to prevent this problem from happening to you. The script would delete old, unneeded mySQL bin-log files. In the spirit of stopping this bug from affecting others, I’ve released a generic version of the script… you just have to set up a few variables in the “configuration” section at the top of the script and you should be good to go. For the code and info on how to make it run daily, see the “update” section of the post on my blog. It is, of course, free and open source.

Hope that helps someone!