circuit board

The Fall and Rise of a Server

Some of you may have noticed an extended outage of the server that serves this blog, Hazel Dell Christian Church’s web site (www.hdchristian.org) and some of the ancillary sites that I support.  This outage was caused by a string of problems the likes of which I’ve never seen at any client – but I’m really glad it hit me rather than a client.

Clean Living

I feel compelled to tell you that the server – RENEGADE – has had a clean life.  By that I mean it’s always been in a data center or collocation facility.  As a result it got clean steady power and a temperature and humidity controlled environment.  It’s not one of those servers shoved under my desk, hidden in a closet under a fichus plant, or any other of the strange things clients do to servers.  Despite its clean living it had a series of issues that started May 4th.

Wake Up Call

On Friday morning I woke up away from my office in Danbury, CT.  I noticed that I didn’t get any mail since about 12:45 AM.  Looking down I saw that Exchange was disconnected.  When I investigated further I found that the whole server was offline.  A call to my friends at BlueLock and the server was back up.  I saw that it was at 0x00000077 bug check that brought the server down.  I didn’t bother checking into it much because I was on client site.

Echos of Nightmares

I can’t think of much more disturbing than a server going back down 20 minutes after I brought it up.  That’s what happened to me.  Another break from the client and another call had the guys unracking the server and arranging to transport it to my house – I can’t tell you how grateful I am for the guys at BlueLock.  Not only am I not really on their platform but because of our long friendship they were bending over backwards to make sure that I was taken care of.

But How Late is Late?

So I flew in later that night and set about figuring out what was wrong.  The initial analysis was strange.  The bugcheck is most likely caused by a disk read error while reading the paging file.  That would be odd in and of itself, but even odder when you consider that the server has mirrored drives – a mirror that reported healthy when I looked at it in Disk Manager.  Still new disks are relatively cheap and it’s an excuse to upgrade storage but alas Fry’s isn’t open 24 hours a day yet so at 3 AM I go to bed to go get drives in the morning.

Stupid is as Stupid Does

Here is where I made the critical mistake.  I had an opportunity to start a backup before I went to bed – or before I left for Fry’s in the morning.  I didn’t.  Frankly I didn’t expect any problems.  It would have been a *REALLY* good idea to have taken a backup at this point.

Broken Mirrors

So I get back, install one of the new drives, remove one of the old drives and try to start a mirror … and it fails.  A few more attempts and I break out Ghost 2003.  I try to get Ghost to image the drive across – and get an internal error and decide to give up.  I’ve not had any great luck with Ghost and I had considered trying the upgrade – but when Symantec’s web site told me I couldn’t upgrade I felt like I was not going to get anything from it.  (Certainly didn’t feel like they valued existing customers.)

So I placed the other drive into the system and tried to use it to make a copy to one of the new drives.  That was fun – except for the fact that the drive wouldn’t boot – and after trying to recover the MBR the partition table was unreadable.  It turns out it doesn’t matter anyway because it wouldn’t stay operational for more than a few minutes at a time anyway.  (I’m really confused how the mirror reported healthy at this point.)

Step Back and Punt

I have a backup – it’s a week old but frankly there’s not that much going on in my system that a week is that big a problem.  (Remember above I should have taken another backup.)  So I install the server again with a new name and build the OS.

This seemingly takes an eternity.  I go to restore the system state – a part of my backup – and I find to my dismay that I don’t have a system state backup.  Sure I placed it in the selection list but despite my best efforts, I can’t get the stupid thing out of the backup file.  Of course, that means I have to reload a bunch of software – but I can manage.

Chasing My Tail

I rebuilt the system and got it operational with drivers and such.  I go to start putting back services and start running into problems.  I spend several (ok more than that) hours trying to get Exchange to accept the backups and it’s not working.  I place a call to Microsoft product support to get some help but the technician I got was unable to understand the case # I gave him, wasted 30 minutes trying to talk to a technical router, and basically got me so frustrated with him and the team’s service that I decided that I’d deal with it another time.

I did manage to get the server accessible from the outside via terminal services – which is important because I was leaving on Sunday evening to fly to Anaheim to the Advisor Summit on Microsoft SharePoint.  That meant leaving the server down including web sites, mail, etc.  Not something I was looking forward to.

Google isn’t Completely Evil

Before leaving I did one other thing was setup Google for Domains to take care of my mail – including my wife’s mail, my assistant, etc.  I tested it and it was working.  However, rather than setting up DNS like they wanted I took my existing Anti-Spam/Anti-Virus vendor and pointed their output to Google instead of my server – I figured that would minimize the DNS instability time when I came back online.

You’ve never had fun until…

While in Anaheim, between my presentations, I found enough time to start working on the server some more.  I ended up calling back in to Microsoft for a three hour phone call to fix Exchange.  It seems that my online backup that I did was corrupt.  We ended up doing a soft recovery on the off-line backup I took as well (while Exchange was running.)  I managed my own way through a set of issues with the SQL 2005 installer which requires precise folder names if you’re installing from a drive rather than CD media.  (Remember I’m miles away …)  the net effect is that by the time I make it home I’ve got most of the operations working on the server and I should literally not need to do much more than plug it in.

I decide not to move mail over to the server because I’m on a home cable modem and frankly I think every home cable modem is blacklisted for mail.  It’s easier just to wait until the next day to plug things in since the critical thing – email – had a workaround.

Or so I thought…

Well, that’s what I thought until my assistant called and told me she couldn’t log in to Google.  Despite Google continuing to receive mail for the domain they had shut off email for users – without notifying me – because they perceived that I didn’t have email configured correctly.  I managed to “bump” Google and get it convinced to reinstate the email accounts – thanks to the free WiFi at Phoenix SkyHarbor airport.

Back in the Saddle

The next day I put the server back in the rack test operations and find everything working … except for SSL certificates, a few weird web site quirks, and the assorted other issues that I had dealt with over the years but since I lost my system state I needed to redo.

What did I learn from this?

If you’re like most of the folks that I relate this story to you’re saying “What are the odds?”, “How unlucky can you get?”, “You’re pulling my leg, right?”, etc.  My perspective is how I started out – I’m glad it happened to me where the impact is containable rather than to a client.  The lesson is a bit more complicated.  The short is that multiple layers of redundancy WILL fail.  It’s just a matter of when.  Should have all of these things happened at the same time?  No.  Could it happen to you? Yes.  So my question is, how many things have to go wrong before you lose too much data or have a failure you really can’t recover from?