The promised details about the problems from earlier.

May 29, 2008

I mentioned before that I could explain the weirdness that’s been going on for the last couple of days on the site.  If you’re curious, here are the technical details:

We Use Master-Slave Replication

On LyricWiki, one of the tricks we use for scaling is called master-slave replication (that’s a how-to, not an explanation).  The crux of the concept is that you can have one database (the master) which holds “the truth” and which you can read from and write to.  The other server (the slave) is just for reading – it only knows what’s going on because the master gives it regular (almost instant) updates.  Since the slave knows that it only has to listen to the master for updates, it can hold onto the data in a way that makes it REALLY fast for reading. There are several things you can do on LyricWiki that only require reading. For example, if you’re logged-out and are simply viewing a page, that’s a bunch of read-requests. We send 80% of the reads to the slave, and the master handles all of the writes and the other 20% of reads.

What went wrong – Hard-drive space

The other night, the site started behaving weirdly. We started getting errors saying that there were too many connections. This should be rare at this point, and is generally fixed by restarting the database server (ie: kicking everyone out for a second and only serving those who come back). The errors kept happening though. To make a long story short, it became clear that the reason is that we ran out of hard-drive space on the server (the error messages weren’t too helpful in finding this). The reason we ran out of space is that the master and slave mentioned above communicate through a log-file. Each time a change happens on the master, it writes that change to a log-file which the slave is keeping an eye on.

The problem is that there are a TON of things happening on LyricWiki. Since the last time the logs were cleared (mid-march), we filled up 100gigs of hard-drive space with these log-files! The solution should have been just to move some of the old files to another hard-drive and restart the system. That’s what I did and that’s when things started getting weird.

Enter the mySQL bug

After the fix, things were taking a while to get back to normal. Then it became clear that they weren’t heading back to normal at all. Edits to pages weren’t taking effect, histories and diffs were skipping things… everything was a mess. After some digging, it became clear that the master-slave replication mentioned above was completely split. This means that the slave remembered the site the way it was a couple of days ago and wasn’t getting updates from the master. To make things even more confusing for users, the fact that 80% of the reads go to the master means that users would see the wrong pages 4/5 times and the correct (up-to-date) page 1/5 times. That’s just a really good way to drive someone bonkers :).

As it turns out, there is a known bug in mysql which will mess up the log file that the master and slave use to communicate in the event that you run out of hard-drive space. Yikes! This makes your slave completely and hopelessly lost. You have to delete it and start over. This means you need to have at least a brief outage on your site. The more data you have, the longer your outage.

Postmortem

1. mySQL – You’re still my homeboy

That’s a really bad bug for mySQL to have laying around, but I don’t condemn mySQL as a system because of it. I’ve been absolutely abusing that database system for about 6 years and this is the first time I found any errors to be mySQL’s fault. We often hear frustrated programmers saying their must be a bug in their underlying technology, and they are almost always wrong. It is usually the programmer’s fault. I’m as guilty as the next guy (esp. early on in my coding career before I started to realize the embarrassing pattern). Long story short: I still view mySQL as one of the most robust systems I work with. Much love. They need to fix this bug though, it will seriously bite every major site once it has to scale.

2. Keep this from happening to you

If you are using master-slave replication, learn from my mistakes and do a bit of preventative-work, it’ll be faster than mopping up after this mess. Write a daily cron-job to find out what log-file you are currently using on the slave and delete all of the older logs. If you have extra hard-drives, feel free to move the old logs and store them forever, but they’re probably just too big and unnecessary to keep around for very long.

3. I need monitoring!

I’d say this got fixed relatively smoothly considering how messed up things were… I checked disk space really early on just on a hunch, but I really had no hints from error-logs so this could have dragged on much longer. It was still about 2 days of unacceptably poor behavior for a site. About 8 hours of that was wasted before I found out. I’ve got to get a heart-beat script to run on another server and watch to make sure LyricWiki is always working and shoot me an email and/or text-message when it goes down. Then I’ll always be on it. I’ve written about a dewtrillion of these (that’s a lot because it’s caffeinated), I don’t know why I haven’t made one for LyricWiki yet.

Conclusion

Well, there you have it. Way more technical info on the problem than you could have ever wanted! All technical issues aside, I’m sorry it went down again. LyricWiki is not my full-time gig, so admittedly it’s a little sloppy for something used for millions of requests per day. Some day I may be able to devote more time to keep the site humming better, but until then I’ll just have to make better use of my time and be smarter about what I choose to do in the time I have (like write that monitor script!).

Even though it was a mySQL bug that caused it, they’re my servers and there is no excuse. My bad.


*Whew* Back again

May 28, 2008

We’re back and things appear to be running at 100%.  I’ll explain what happened in a post a little later today, but I have another pressing matter to work on before then.  Sorry for the problems!


Fixing errors.

May 28, 2008

Due to an unfortunate series of events, the replication (master/slave thing where one server gets a copy for fast reading of data) has to be started over.  I’m working on it now.  Once it’s done I’ll update with details on what happened.

I’m going to be taking down the API and possibly the site off-and-on until this is fixed to make it go as quickly as possible.

Sorry 😦


Site is confused

May 28, 2008

The site may appear “confused” today.  There is a slave database which has a copy of the normal database and a certain percentage of requests try to read the data from that server (I think it’s about 75%).  Unfortunately, that database isn’t getting updates from the main database at the moment, so a lot of pages will randomly look like they looked a few days ago.

I’m trying to fix this right now.


…and we’re back!

May 26, 2008

The site is running again.

Some more maintenance needs to be done so this doesn’t keep happening (need to write a script to delete old log files).

On top of that, it appears that the site was down for several hours before I noticed it.  Not cool.  I think I’m going to put some work into a monitor script running on another set of servers which will make sure the site is running well and will send me a txt message when the site goes down.  I guess I have some homework!

PS: Happy Memorial Day!


“Too many connections.”

May 26, 2008

It’s a holiday, so the odds that LyricWiki is crashing are much higher!  (I have no idea why that happens except cruel fate :P).

We’ve been getting some “too many connections” errors lately, but today the site wouldn’t restart for some reason.  I’m working on getting it back up, then I will try to find out why we’ve been getting the error.

This isn’t going well at the moment…


Recent server issues

May 13, 2008

You may have noticed an error message along the lines of “Host ‘pedlfaster.pedlr.com’ is blocked because of many connection errors” recently.

While we kept fixing that in the short-term, it kept popping back up. Now there is a more permanent fix in there, and I’m looking into what caused it (I’m assuming it was a spike in API traffic).

Regardless: you shouldn’t be seeing that anymore. If you do, harass us immediately!