September 12, 2008
This week, the site has been extremely slow and even gone down and up a couple of times. I searched for a problem for a while but it appears that we’ve just really hit the wall on how much traffic we can support with our current servers. That’s fairly good timing since we’d been planning to move to more servers for a little while, so I’d already begun to look into it.
Today I ordered another server with the same specs as the current Apache server. This will bring us up to 5 total servers running LyricWiki. For the curious (and tech-savvy): that’s one squid caching server in front of two Apache web servers which talk to one mysql master server and one read-only replica mysql server.
To get the server to be as beefy as we need, I had to ask the hosting company to order extra RAM for it. So we’re just waiting for that to be delivered (hopefully around this weekend or very soon after) and we’ll be ready to start working to get the new server pulled into our setup.
In addition to just having more man-power machine-power to handle our traffic, this will give two additional benefits immediately. The first is that we can use the new server to test out the upgrade to the newest version of MediaWiki (the software that runs our site as well as Wikipedia). The second benefit is that now we’ll have two Apache servers – currently the most overworked part of the system – with one running the API and one running the site itself (lyricwiki.org). This will let us more quickly identify when something is wrong with one of those two systems and it will make sure that problems with either of them are unlikely to effect the other.
Exciting times… stay tuned!
June 3, 2008
So, in the wake of last week’s problems on LyricWiki, the system was finally purring again at full-steam when this morning at about 1am, our data center had its first-ever full power-outage. This is the kind of thing that isn’t supposed to happen since good datacenters (like LyricWiki’s) have power from multiple sources and generators that kick on if all of those fail. But once in a while a human error will mess the whole thing up and briefly cut the power. The power was only out for about 1 minute which basically just causes the computers to reboot.
This normally would have been fine… the servers are configured in a way that they are all supposed to jump right in where they left off. Once all of them are up, the site should automatically work again. “Should”. For some reason, the master-slave replication is broken again. The slave is trying to read a position in the log-file that doesn’t exist. So yet again, I think we’re going to have to start the replication all over. This means about an hour or more of downtime.
For now, I just disabled the slave (to prevent weird behavior) so the site is going to be working with a lot less resources than it normally is. I’m planning to do the outage at night when the traffic is lower. The site will potentially be slower than normal today since it doesn’t have the slave-server helping out.
All-in-all it could have been a <strong>lot</strong> worse since the downtime so far has only been a minute or so, but please be aware that there will be potentailly an hour or two of downtime some time tonight/tomorrow-morning.
June 1, 2008
Earlier this week we had some really strange errors on the site (detailed description here) that were caused by a mySQL bug which will eventually burn anyone who is running master/slave replication.
I mentioned that you’d need to write a script to prevent this problem from happening to you. The script would delete old, unneeded mySQL bin-log files. In the spirit of stopping this bug from affecting others, I’ve released a generic version of the script… you just have to set up a few variables in the “configuration” section at the top of the script and you should be good to go. For the code and info on how to make it run daily, see the “update” section of the post on my blog. It is, of course, free and open source.
Hope that helps someone!
May 28, 2008
Due to an unfortunate series of events, the replication (master/slave thing where one server gets a copy for fast reading of data) has to be started over. I’m working on it now. Once it’s done I’ll update with details on what happened.
I’m going to be taking down the API and possibly the site off-and-on until this is fixed to make it go as quickly as possible.
May 13, 2008
You may have noticed an error message along the lines of “Host ‘pedlfaster.pedlr.com’ is blocked because of many connection errors” recently.
While we kept fixing that in the short-term, it kept popping back up. Now there is a more permanent fix in there, and I’m looking into what caused it (I’m assuming it was a spike in API traffic).
Regardless: you shouldn’t be seeing that anymore. If you do, harass us immediately!
March 16, 2008
Once more into the breach!
I’m going to hurl myself at the database-replication problem again. The current plan I have unfortunately involves some downtime for the site. But it’s already after 1am so I’m hoping the impact on users is relatively low.
I’ll keep you updated.
February 25, 2008
If you see “Too many connections.” this is because the database got locked up during a backup. This backup should be completed in about 5 minutes. Sorry for the delays! 😦