June 7, 2008
It appears that the problems on the site were due to someone (read: me) messing up when they restarted the database-replication!
Thanks to a bunch of helpful problem-reports from a number of users, I had some good data to look at to figure out what was wrong. It was actually pretty easy to figure out once I had all of those problem-pages to look at (I’m talking about you Brian May!).
While I was waiting for the computers to move some massive files around, I had a couple of minutes here and there to make other tweaks to the site. Two somewhat interesting things that came out of this time are that the 1) “job queue” is getting automatically run every hour now (which keeps things up to date and avoids assigning extra jobs to random users who would get like a 2 minute page-load randomly every 10,000 pages) and 2) the road-block page that shows up when the site is shut-down for maintenance now has an iframe in it which shows a google-search of LyricWiki.org for the same page and suggests that users click the “Cached” link. This will allow people to see a somewhat-recent of most of the pages even while the site is down. I dig it.
Special thanks/shoutouts to Kiefer, Redxx, Senvaikis, Teknomunk and WillMak050389 for their help figuring out what was wrong and testing things to make sure they were fixed!
As always, please let me know if you see something strange on the site. Thanks!
June 7, 2008
LyricWiki.org is back up. I’m relatively sure I fixed everything. I’ll be back on to check things in a bit & I’ll have more details once I find out if things are actually working.
June 6, 2008
The site has been behaving strangely lately and I’m not sure how I managed to break it or what is wrong, but I’m looking into it actively. I’m going to be working on any intensive-changes starting tomorrow (saturday) morning since weekends tend to have less traffic than weekdays.
People have been forwarding me a good bit of info about the problems, and it has been very helpful. If you have seen anything strange (especially if you have noticed a pattern in it), please pass the information along to me!
Hopefully the site will be all patched up by the end of tomorrow. Check the blog for updates as things are happening (brief outages are very likely tomorrow).
I really apologize for the oddities on the site. They are almost certainly my fault (and even if they weren’t… they would be since it’s my responsibility to keep things humming along).
Thanks for your patience,
June 3, 2008
That was faster than I expected 🙂
June 3, 2008
Site can’t survive without the slave-server. Fixing now…
June 3, 2008
So, in the wake of last week’s problems on LyricWiki, the system was finally purring again at full-steam when this morning at about 1am, our data center had its first-ever full power-outage. This is the kind of thing that isn’t supposed to happen since good datacenters (like LyricWiki’s) have power from multiple sources and generators that kick on if all of those fail. But once in a while a human error will mess the whole thing up and briefly cut the power. The power was only out for about 1 minute which basically just causes the computers to reboot.
This normally would have been fine… the servers are configured in a way that they are all supposed to jump right in where they left off. Once all of them are up, the site should automatically work again. “Should”. For some reason, the master-slave replication is broken again. The slave is trying to read a position in the log-file that doesn’t exist. So yet again, I think we’re going to have to start the replication all over. This means about an hour or more of downtime.
For now, I just disabled the slave (to prevent weird behavior) so the site is going to be working with a lot less resources than it normally is. I’m planning to do the outage at night when the traffic is lower. The site will potentially be slower than normal today since it doesn’t have the slave-server helping out.
All-in-all it could have been a <strong>lot</strong> worse since the downtime so far has only been a minute or so, but please be aware that there will be potentailly an hour or two of downtime some time tonight/tomorrow-morning.
June 1, 2008
Earlier this week we had some really strange errors on the site (detailed description here) that were caused by a mySQL bug which will eventually burn anyone who is running master/slave replication.
I mentioned that you’d need to write a script to prevent this problem from happening to you. The script would delete old, unneeded mySQL bin-log files. In the spirit of stopping this bug from affecting others, I’ve released a generic version of the script… you just have to set up a few variables in the “configuration” section at the top of the script and you should be good to go. For the code and info on how to make it run daily, see the “update” section of the post on my blog. It is, of course, free and open source.
Hope that helps someone!