LyricWiki.org is back up. I’m relatively sure I fixed everything. I’ll be back on to check things in a bit & I’ll have more details once I find out if things are actually working.
The site has been behaving strangely lately and I’m not sure how I managed to break it or what is wrong, but I’m looking into it actively. I’m going to be working on any intensive-changes starting tomorrow (saturday) morning since weekends tend to have less traffic than weekdays.
People have been forwarding me a good bit of info about the problems, and it has been very helpful. If you have seen anything strange (especially if you have noticed a pattern in it), please pass the information along to me!
Hopefully the site will be all patched up by the end of tomorrow. Check the blog for updates as things are happening (brief outages are very likely tomorrow).
I really apologize for the oddities on the site. They are almost certainly my fault (and even if they weren’t… they would be since it’s my responsibility to keep things humming along).
Thanks for your patience,
That was faster than I expected
So, in the wake of last week’s problems on LyricWiki, the system was finally purring again at full-steam when this morning at about 1am, our data center had its first-ever full power-outage. This is the kind of thing that isn’t supposed to happen since good datacenters (like LyricWiki’s) have power from multiple sources and generators that kick on if all of those fail. But once in a while a human error will mess the whole thing up and briefly cut the power. The power was only out for about 1 minute which basically just causes the computers to reboot.
This normally would have been fine… the servers are configured in a way that they are all supposed to jump right in where they left off. Once all of them are up, the site should automatically work again. “Should”. For some reason, the master-slave replication is broken again. The slave is trying to read a position in the log-file that doesn’t exist. So yet again, I think we’re going to have to start the replication all over. This means about an hour or more of downtime.
For now, I just disabled the slave (to prevent weird behavior) so the site is going to be working with a lot less resources than it normally is. I’m planning to do the outage at night when the traffic is lower. The site will potentially be slower than normal today since it doesn’t have the slave-server helping out.
All-in-all it could have been a <strong>lot</strong> worse since the downtime so far has only been a minute or so, but please be aware that there will be potentailly an hour or two of downtime some time tonight/tomorrow-morning.
Earlier this week we had some really strange errors on the site (detailed description here) that were caused by a mySQL bug which will eventually burn anyone who is running master/slave replication.
I mentioned that you’d need to write a script to prevent this problem from happening to you. The script would delete old, unneeded mySQL bin-log files. In the spirit of stopping this bug from affecting others, I’ve released a generic version of the script… you just have to set up a few variables in the “configuration” section at the top of the script and you should be good to go. For the code and info on how to make it run daily, see the “update” section of the post on my blog. It is, of course, free and open source.
Hope that helps someone!
I mentioned before that I could explain the weirdness that’s been going on for the last couple of days on the site. If you’re curious, here are the technical details:
We Use Master-Slave Replication
On LyricWiki, one of the tricks we use for scaling is called master-slave replication (that’s a how-to, not an explanation). The crux of the concept is that you can have one database (the master) which holds “the truth” and which you can read from and write to. The other server (the slave) is just for reading – it only knows what’s going on because the master gives it regular (almost instant) updates. Since the slave knows that it only has to listen to the master for updates, it can hold onto the data in a way that makes it REALLY fast for reading. There are several things you can do on LyricWiki that only require reading. For example, if you’re logged-out and are simply viewing a page, that’s a bunch of read-requests. We send 80% of the reads to the slave, and the master handles all of the writes and the other 20% of reads.
What went wrong – Hard-drive space
The other night, the site started behaving weirdly. We started getting errors saying that there were too many connections. This should be rare at this point, and is generally fixed by restarting the database server (ie: kicking everyone out for a second and only serving those who come back). The errors kept happening though. To make a long story short, it became clear that the reason is that we ran out of hard-drive space on the server (the error messages weren’t too helpful in finding this). The reason we ran out of space is that the master and slave mentioned above communicate through a log-file. Each time a change happens on the master, it writes that change to a log-file which the slave is keeping an eye on.
The problem is that there are a TON of things happening on LyricWiki. Since the last time the logs were cleared (mid-march), we filled up 100gigs of hard-drive space with these log-files! The solution should have been just to move some of the old files to another hard-drive and restart the system. That’s what I did and that’s when things started getting weird.
Enter the mySQL bug
After the fix, things were taking a while to get back to normal. Then it became clear that they weren’t heading back to normal at all. Edits to pages weren’t taking effect, histories and diffs were skipping things… everything was a mess. After some digging, it became clear that the master-slave replication mentioned above was completely split. This means that the slave remembered the site the way it was a couple of days ago and wasn’t getting updates from the master. To make things even more confusing for users, the fact that 80% of the reads go to the master means that users would see the wrong pages 4/5 times and the correct (up-to-date) page 1/5 times. That’s just a really good way to drive someone bonkers :).
As it turns out, there is a known bug in mysql which will mess up the log file that the master and slave use to communicate in the event that you run out of hard-drive space. Yikes! This makes your slave completely and hopelessly lost. You have to delete it and start over. This means you need to have at least a brief outage on your site. The more data you have, the longer your outage.
1. mySQL – You’re still my homeboy
That’s a really bad bug for mySQL to have laying around, but I don’t condemn mySQL as a system because of it. I’ve been absolutely abusing that database system for about 6 years and this is the first time I found any errors to be mySQL’s fault. We often hear frustrated programmers saying their must be a bug in their underlying technology, and they are almost always wrong. It is usually the programmer’s fault. I’m as guilty as the next guy (esp. early on in my coding career before I started to realize the embarrassing pattern). Long story short: I still view mySQL as one of the most robust systems I work with. Much love. They need to fix this bug though, it will seriously bite every major site once it has to scale.
2. Keep this from happening to you
If you are using master-slave replication, learn from my mistakes and do a bit of preventative-work, it’ll be faster than mopping up after this mess. Write a daily cron-job to find out what log-file you are currently using on the slave and delete all of the older logs. If you have extra hard-drives, feel free to move the old logs and store them forever, but they’re probably just too big and unnecessary to keep around for very long.
3. I need monitoring!
I’d say this got fixed relatively smoothly considering how messed up things were… I checked disk space really early on just on a hunch, but I really had no hints from error-logs so this could have dragged on much longer. It was still about 2 days of unacceptably poor behavior for a site. About 8 hours of that was wasted before I found out. I’ve got to get a heart-beat script to run on another server and watch to make sure LyricWiki is always working and shoot me an email and/or text-message when it goes down. Then I’ll always be on it. I’ve written about a dewtrillion of these (that’s a lot because it’s caffeinated), I don’t know why I haven’t made one for LyricWiki yet.
Well, there you have it. Way more technical info on the problem than you could have ever wanted! All technical issues aside, I’m sorry it went down again. LyricWiki is not my full-time gig, so admittedly it’s a little sloppy for something used for millions of requests per day. Some day I may be able to devote more time to keep the site humming better, but until then I’ll just have to make better use of my time and be smarter about what I choose to do in the time I have (like write that monitor script!).
Even though it was a mySQL bug that caused it, they’re my servers and there is no excuse. My bad.
Due to an unfortunate series of events, the replication (master/slave thing where one server gets a copy for fast reading of data) has to be started over. I’m working on it now. Once it’s done I’ll update with details on what happened.
I’m going to be taking down the API and possibly the site off-and-on until this is fixed to make it go as quickly as possible.
The site is running again.
Some more maintenance needs to be done so this doesn’t keep happening (need to write a script to delete old log files).
On top of that, it appears that the site was down for several hours before I noticed it. Not cool. I think I’m going to put some work into a monitor script running on another set of servers which will make sure the site is running well and will send me a txt message when the site goes down. I guess I have some homework!
PS: Happy Memorial Day!
It’s a holiday, so the odds that LyricWiki is crashing are much higher! (I have no idea why that happens except cruel fate :P).
We’ve been getting some “too many connections” errors lately, but today the site wouldn’t restart for some reason. I’m working on getting it back up, then I will try to find out why we’ve been getting the error.
This isn’t going well at the moment…
You may have noticed an error message along the lines of “Host ‘pedlfaster.pedlr.com’ is blocked because of many connection errors” recently.
While we kept fixing that in the short-term, it kept popping back up. Now there is a more permanent fix in there, and I’m looking into what caused it (I’m assuming it was a spike in API traffic).
Regardless: you shouldn’t be seeing that anymore. If you do, harass us immediately!