I mentioned before that I could explain the weirdness that’s been going on for the last couple of days on the site. If you’re curious, here are the technical details:
We Use Master-Slave Replication
On LyricWiki, one of the tricks we use for scaling is called master-slave replication (that’s a how-to, not an explanation). The crux of the concept is that you can have one database (the master) which holds “the truth” and which you can read from and write to. The other server (the slave) is just for reading – it only knows what’s going on because the master gives it regular (almost instant) updates. Since the slave knows that it only has to listen to the master for updates, it can hold onto the data in a way that makes it REALLY fast for reading. There are several things you can do on LyricWiki that only require reading. For example, if you’re logged-out and are simply viewing a page, that’s a bunch of read-requests. We send 80% of the reads to the slave, and the master handles all of the writes and the other 20% of reads.
What went wrong – Hard-drive space
The other night, the site started behaving weirdly. We started getting errors saying that there were too many connections. This should be rare at this point, and is generally fixed by restarting the database server (ie: kicking everyone out for a second and only serving those who come back). The errors kept happening though. To make a long story short, it became clear that the reason is that we ran out of hard-drive space on the server (the error messages weren’t too helpful in finding this). The reason we ran out of space is that the master and slave mentioned above communicate through a log-file. Each time a change happens on the master, it writes that change to a log-file which the slave is keeping an eye on.
The problem is that there are a TON of things happening on LyricWiki. Since the last time the logs were cleared (mid-march), we filled up 100gigs of hard-drive space with these log-files! The solution should have been just to move some of the old files to another hard-drive and restart the system. That’s what I did and that’s when things started getting weird.
Enter the mySQL bug
After the fix, things were taking a while to get back to normal. Then it became clear that they weren’t heading back to normal at all. Edits to pages weren’t taking effect, histories and diffs were skipping things… everything was a mess. After some digging, it became clear that the master-slave replication mentioned above was completely split. This means that the slave remembered the site the way it was a couple of days ago and wasn’t getting updates from the master. To make things even more confusing for users, the fact that 80% of the reads go to the master means that users would see the wrong pages 4/5 times and the correct (up-to-date) page 1/5 times. That’s just a really good way to drive someone bonkers :).
As it turns out, there is a known bug in mysql which will mess up the log file that the master and slave use to communicate in the event that you run out of hard-drive space. Yikes! This makes your slave completely and hopelessly lost. You have to delete it and start over. This means you need to have at least a brief outage on your site. The more data you have, the longer your outage.
1. mySQL – You’re still my homeboy
That’s a really bad bug for mySQL to have laying around, but I don’t condemn mySQL as a system because of it. I’ve been absolutely abusing that database system for about 6 years and this is the first time I found any errors to be mySQL’s fault. We often hear frustrated programmers saying their must be a bug in their underlying technology, and they are almost always wrong. It is usually the programmer’s fault. I’m as guilty as the next guy (esp. early on in my coding career before I started to realize the embarrassing pattern). Long story short: I still view mySQL as one of the most robust systems I work with. Much love. They need to fix this bug though, it will seriously bite every major site once it has to scale.
2. Keep this from happening to you
If you are using master-slave replication, learn from my mistakes and do a bit of preventative-work, it’ll be faster than mopping up after this mess. Write a daily cron-job to find out what log-file you are currently using on the slave and delete all of the older logs. If you have extra hard-drives, feel free to move the old logs and store them forever, but they’re probably just too big and unnecessary to keep around for very long.
3. I need monitoring!
I’d say this got fixed relatively smoothly considering how messed up things were… I checked disk space really early on just on a hunch, but I really had no hints from error-logs so this could have dragged on much longer. It was still about 2 days of unacceptably poor behavior for a site. About 8 hours of that was wasted before I found out. I’ve got to get a heart-beat script to run on another server and watch to make sure LyricWiki is always working and shoot me an email and/or text-message when it goes down. Then I’ll always be on it. I’ve written about a dewtrillion of these (that’s a lot because it’s caffeinated), I don’t know why I haven’t made one for LyricWiki yet.
Well, there you have it. Way more technical info on the problem than you could have ever wanted! All technical issues aside, I’m sorry it went down again. LyricWiki is not my full-time gig, so admittedly it’s a little sloppy for something used for millions of requests per day. Some day I may be able to devote more time to keep the site humming better, but until then I’ll just have to make better use of my time and be smarter about what I choose to do in the time I have (like write that monitor script!).
Even though it was a mySQL bug that caused it, they’re my servers and there is no excuse. My bad.