Cligs Stability: History and Fix
Montag, März 23rd, 2009As regulars would have noticed, Cligs hasn’t been doing well with respect to uptime lately. Firstly, big apologies for that. Cligs has been growing at a rate faster than I was able to deal with, and I could have pre-empted this kind of growth given the hints that were coming in back in January.
Secondly, and that’s what I want to talk about, is that the source of the instability has been identified and, I believe, fixed. Although the fix has been uploaded and is live and working, I need to do a bit of a clean-up before the server is completely stable.
The clean up will take a few hours, probably overnight, and while it’s going Cligs might be a bit slow. Rest assured that once it’s done, I think we’ll be all set. I’ll be keeping an eye on this of course, and will intervene if something doesn’t look right.
What follows now is very geeky so please feel free to stop reading
What’s the source? Cligs runs on Ubuntu, which is based on Debian Linux. Debian has a very strict security premission setting for the directory PHP stores its session files; by default that’s /var/lib/php5. The side-effect of the security settings is that PHP itself can’t clean up the directory with stale sessions and so the Debian solution is to run a "garbage collector" cron job that once a day does the cleanup.
All fine and good until you get a very popular service that generates session files faster than the garbage collector can deal with them. This triggers a cascade in which the cleanup takes longer and longer chasing down more and more session files. It takes up significant memory and uses up the processor: on Cligs, one of the 4 CPUs on the server has been constantly in use by this cleanup process!
This run-away scenario has another side-effect: the more files that accumulate in the directory, the more time it takes PHP to find the right session file. This slows down access significantly (seen as doubling the latency from ~200ms to 400+ms) and slows down Apache. This cascades more as each Apache process would need more memory and more time to work, which uses up more resources, which squeezes other processes themselves wanting more resources, and you know how this ends.
Which brings us to today’s fix: I’ve changed the way the core of Cligs works to generate fewer session files - don’t ask, we live and learn and think about things differently
But that still leaves the directory with lots of session files. That’s what the cleanup is doing: clearing out the stale session files to get the server back to a clean start.
Keen readers would note that I’ve talked about this very exact problem before. Back then I talked about a full fix which was working until recently. That’s how it goes: you live to go down another day.

Why is this useful? Imagine this scenario: you find a page you’d like to tweet about, so you click the Cligs bookmarklet, and then click the Twitter link. With this new addition, you simply edit the text of the tweet instead of copy/paste or otherwise compose the text.
So what does such a service look like? Let’s take a look at