when it rains, it pours

Date: 2007-02-16 23:30:00

I have had a bad week. More specifically, my hosted server (which is my web server, mail server, dns server, database server, subversion server, etc) has had a bad week. It all started when I tried to upgrade clamav.

The other day clamav was complaining that there was a new version and I ought to upgrade. No problem, I'll just go do that like I have for the last half dozen program updates. Download, configure, make, make install, done. The next morning, I noticed that freshclam was no longer getting updates to the virus definition files (I have automatic monitoring that looks for this sort of thing). So, I check into it and it's complaining that it doesn't have libgmp built in. Sure enough, the configure script doesn't seem to be able to find my libgmp by default so I hack it into the configure script and rebuild it: make, make install, done. Leave it alone.

Eight hours later (this has been a super busy week at work, too), I notice that I've received no new email. Sure enough, my mail server has eaten every single incoming email for the past eight hours because clamscan is erroring out on startup for some reason. Reinstall old version, mail works again. Add "figure out clamav" to todo list. Pore through the mail logs to look for "From" addresses of people who might have sent non-spam email. From about 200 connections, there were about six legitimate messages. Email a humble apology and request to resend to each of them (Amy was working on business stuff yesterday and needed those messages).

The next morning, I wake up and check my email - hundreds of "out of disk space" messages from my mail server. Try to ssh in, no response. Looks like it's used up all the available space and wedged itself (yes, I shouldn't have everything in one big partition - but it was preconfigured by the hosting provider that way). I call up support and tell them the name of a file they should be able to delete to free up some space. This involved them rebooting the server, entering single user mode, and nuking files as root. Kind of scary for me. Anyway, apparently that file wasn't big enough or something and support emails me back saying it didn't work and perhaps I could think of something else that's bigger. Sure, I reply, there's a couple of files in this other directory that are pretty big. Wait for response. Keep waiting. Start getting frustrated and impatient. Amy jokes that maybe the support guy went to lunch (it was noon in Texas). 45 minutes pass, no action. Call them back (at 25c/min), find out that the support guy really did go to lunch, no joke, impress upon the other guy that I really would like my server back up. Half an hour later, no action. I have to go to work at this point, but soon after I get to work it comes back up. Remember those files I told him about in email that he could nuke? He nuked the entire directory that contained those files. Not exactly what I had in mind, but fortunately it wasn't anything important.

Ok, server is back up and running for today. I still haven't figured out why it ran out of space. After coming home from work and sleuthing around, I find it - postgresql has been having trouble archiving its log files to amazon s3 (I've got custom glue that makes this work, or at least is supposed to) for the past three weeks. There's a hundred or so 16 MB files in the pg_xlog directory. After more detective work, I find that this stopped working three weeks ago because there had been a power problem at the hosting center and my server was rebooted. When postgresql restarted, it came up with a different PATH and environment than when I had set up the amazon s3 archiving, and the archiving was broken. Not just for one reason, but for two separate reasons. Which made for great fun when I fixed the first reason but didn't think it worked (because the failure mode was the same as before) so backed out that change and was at a loss for more ideas. Finally after adding more diagnostics I figured out the real reason and made the log backup work again.

One point to note here is that in my diagnostic activities, I temporarily broke the log file archiving so a bunch of log files were not actually backed up, but were just deleted. No matter, I'll just run a new full postgresql backup. I needed to do that anyway.

While I'm running the postgresql full backup, I notice that new shells are taking an unusually long time to start. Looking in /var/log/messages shows a ton of scary messages like "Feb 16 08:34:50 occam kernel: ad0: FAILURE - READ_DMA status=51<READY,DSC,ERROR> error=40<UNCORRECTABLE> LBA=125165407". Great, hardware failure. Why now? Why in the middle of my database backup? Now the latest database backup I have is three weeks old.

My server hardware is owned by the hosting provider so if it really does die, they'll fix me up with a new box or something. But they provide no backup services, the data is all my problem.

Now it's midnight and I'm in the process of salvaging everything I can off the server in the expectation that it's going to fall over and die any minute now. I've had various backups in place but you know there's always the feeling in the back of your mind that it's perhaps not enough? Did I back up that configuration file? What about the local modifications I had to make to that other program to make it work with the frombozulator? My hosted server has been running for two years and I've done a ton of work on making it work just the way I want. It serves something like 20 domains (about 7 of which are actually important) and email for Amy and me.

The annual renewal for my hosted server is up next week or so and I'm considering throwing in the towel and just getting a managed shared dreamhost account for a tenth the price. Since dreamhost doesn't appear to offer postgresql, I might have to do unpalatable things like port minilink.org to mysql. And maybe even to something other than python.

The really weird thing is that it appears that the three major failures (clamav eating email; postgresql eating disk space; hardware eating itself) appear to be unrelated. In each case, I resolved the problem before the next problem started. But why did they all happen in sequence? Why is the universe doing this to me?

Comment

gurzil
2007-02-16T13:32:13Z

Ack, fun.

I guess it is these times that you remind yourself why it was worth it in the first place. When you know how to do it yourself, any hosting provider is just going to piss you off by doing it wrong.

BTW, regarding clamav, I am almost considering ditching it. Greylisting takes out such a large chunk of spam for me, I'm not sure another 2-3% matters enough. I disabled the RBLs I was using, and have not had any increase. I get around one spam a day that actually gets through anymore (on 5 domains).

You might consider going to a virtual hosted machine. Less hardware worries, and so far the performance has been fine (although I am also stuck with one partition).

leroy_brown242
2007-02-16T15:05:40Z

This makes m have managed.com flashbacks.

bad thing, after bad thing, after inept help, after hardware failure, after swear words, . . . .you get the picture.

Good luck.

cetan
2007-02-16T15:40:01Z

Man, that really sucks. Good luck with getting things back up to some semblance of "order."

pasketti
2007-02-16T16:26:36Z

Why is the universe doing this to me?

Because you are so much fun to pick on.

victriviaqueen
2007-02-16T18:53:21Z

Sorry... I can't imagine the suckage of being thousands of miles and multiple timezones away from my server. One timezone is enough (most of our stuff is on a server in Alberta). But, if anyone can work through the nonsense, it's you.

decibel45
2007-02-16T19:30:12Z

Since dreamhost doesn't appear to offer postgresql

If you get a shell account it wouldn't be hard to make PostgreSQL happen...

Greg Hewgill <greg@hewgill.com>