Back to normal

November 23, 2007

Whatever that is 😉

Unhappy Thanksgiving :( – site is down

November 22, 2007

The Apache server is apparently frozen (it responds to pings but not much else).  There is a remote reboot tool to give the server a kick, but it seems that in the switch to the four-server setup, the reboot panel didn’t get updated, so it’s not connected to any of our servers anymore.

I’ve sent a message to our host and hopefully it will get rebooted soon.

Sorry 😦

Weirdness fixed!

November 16, 2007

Long story short: everything was backwards but it is fixed now.

For the curious, here are the technical details:

Although undocumented, the MediaWiki doesn’t just use your normal one-line database settings for your master server… if you have multiple servers set up, it uses the server at index 0 as the master.  Our “slave” was at index 0 with 99% of the read load until yesterday.  Then it was still at index 0 but with only 70% of the read load.

The effect this had was that the powerful server just kind of sat there with stale data serving 1% of reads until yesterday, when it started serving 30% of the reads, showing it’s ugly out-of-date data to the world.

I copied the data from the “slave” to the master and changed the configuration so that it really is the master again.  Most of the data (except for a few minutes of changes right before the switch that were mostly by myself and another admin) should still be in tact.

Whatamess that was.  Thanks to all of the admins and contributors on LyricWiki for pointing these problems out to me, and thanks to TimStarling of WikiMedia and domas of MySQL for their help figuring out what was wrong and helping me fix it.  You guys are ninjas.

Good night.

Just plain weird… site temporarily read-only

November 16, 2007

The site has been behaving really weird today.  Some of the content is randomly being used from two weeks ago…

I’m talking to some of the guys from Wikipedia (they’re always helpful) and trying to figure it out.

I reeeeeeeeeally hate having the site read-only, so I’ll unlock it as soon as I can get everything working.

Growing pains FTL. 😦

API back up to speed

November 15, 2007

If you’ve been using the API over the past week, you probably noticed the painfully large percentage of results that were being returned as “Not found”.

The large increase in traffic recently was causing us to get “Too many connections” errors when the API was left alone, so  had to turn on a throttling system which would randomly drop a certain percentage of the requests.  Looking into our server logs, I found out that our actual web server (behind our Squid caching server which serves up 30% of our pages) has been getting over 1 million page requests per day!  Wow… that explains the scaling problems.

I was overly busy for most of the week (a drawback of having LyricWiki not be my “day-job”), so I first got to really attack the problem tonight.  It appears that everything is back up to speed, and the throttling is turned off.  I’ll be keeping an eye on how the site is doing tomorrow during peak traffic time, but I think we should be okay.

I have some more fixes planned for the near-future which should make it so the API can continue to handle increasing traffic.  I probably won’t post about them as they happen, but hopefully you’ll notice an increase in the speed that results are served up.

Spotty outages… very confused.

November 7, 2007

Yesterday I got the replicated slave database up and running and even made the API use MediaWiki’s built-in database-connections which are persistent, so that should have knocked down the amount of connect/disconnects (which are time-expensive).

Today, we’ve still been getting “Too Many Connections” errors… possibly because the MediaWiki persistent connections don’t close very quickly?  We’ll be looking into this some more… maybe I need new stats on how much traffic the API is getting.

Anyway, the solution I’ve taken is that during peak times, I keep setting the API to drop a certain percent of requests.  This isn’t a cool solution, so I’ll be trying to figure out a better way… anyone have any ideas?

Tue 11/6/07 Outages

November 6, 2007

Our API is getting hammered… again.  Part of this is due to a new site that’s using our API:  They’re an interesting site where people can listen to internet radio stations while live-chatting at the same time (presumably conversation revolves at least partially around the music).

I tried some temporary fixes, but I don’t know who I’m kidding: we need to have a read-only database replica for the API to be able to survive.  I should have thought of it earlier, but this week really started pressing it.

I wanted to do this all in the late hours of the night, but we can’t wait any more, the site is getting rocked…
I’ll post when it’s fixed.