March 16, 2008
Everything is all patched up finally. The site and the API are both up, running, and seem to be handling traffic as well as ever. Thanks to everyone who helped as we worked through the problems.
On top of that, thank you to everyone for being patient while the API was down (it ended up being more than 3 straight days). This problem gave me a lot of time to think about how to avoid a similar problem in the future, so hopefully we won’t be going down this path again any time soon!
November 15, 2007
If you’ve been using the API over the past week, you probably noticed the painfully large percentage of results that were being returned as “Not found”.
The large increase in traffic recently was causing us to get “Too many connections” errors when the API was left alone, so had to turn on a throttling system which would randomly drop a certain percentage of the requests. Looking into our server logs, I found out that our actual web server (behind our Squid caching server which serves up 30% of our pages) has been getting over 1 million page requests per day! Wow… that explains the scaling problems.
I was overly busy for most of the week (a drawback of having LyricWiki not be my “day-job”), so I first got to really attack the problem tonight. It appears that everything is back up to speed, and the throttling is turned off. I’ll be keeping an eye on how the site is doing tomorrow during peak traffic time, but I think we should be okay.
I have some more fixes planned for the near-future which should make it so the API can continue to handle increasing traffic. I probably won’t post about them as they happen, but hopefully you’ll notice an increase in the speed that results are served up.