Top-scoring articles in the past 12 months

* = Premium content

Public deprived of WS site for two boring days

Brian Livingston By Brian Livingston

Power users of Microsoft Windows found themselves with nothing to read but blogs when a disk crash took down the WindowsSecrets.com site Oct. 13–14, subjecting Web surfers to 48 hours of utter boredom.

Fortunately, all the site's information was soon back online, to the chagrin of some of our columnists, who'd hoped that a few poorly chosen sentences here and there would disappear forever.

Being the geeks that we are, the Windows Secrets server is crammed with hardware designed to keep things running 24/7. The box is packed with four separate hard disks, which we imaginatively call Drives 0, 1, 2, and 3.

Because hard disks can crash, our server uses RAID technology. RAID, as described by PCGuide.com, instantly switches from a failed hard drive to a second, identical drive. This is supposed to eliminate down time.

A built-in RAID controller on our server's motherboard mirrors Drives 0 and 1, which contain our operating system and thousands of lines of code. An independent RAID add-in card synchronizes Drives 2 and 3, which contain our database.

At 12:10 a.m. Pacific Time on Oct. 13, Drive 3 experienced a head crash. Our RAID setup should have recovered smoothly from this. What we didn't know, however, was that Drive 2 had failed a few weeks earlier. The RAID controller for some reason neglected to notify us back then, when we could have installed a fresh drive. (Or perhaps the e-mail was routed to Microsoft, which outsourced the message and then lost all copies of it, as WS contributing editor Rob Vamosi reports in today's Top Story.)

Lacking the expected responses from Drives 2 and 3, the on-board RAID controller went bonkers, gradually corrupting data sectors on Drives 0 and 1. We learned later that this particular controller behaves poorly in this specific situation. Now they tell me!

At this point, all four drives in our vaunted RAID array were rendered useless. The good news is that all of our information is fine and our server is fully restored.

Thankfully, we're a bit fanatical about backups here. Not only does our server make a nightly backup, which is stored deep beneath a mountain somewhere. It also communicates in real-time with a replication server that we keep far away from the Web server.

As it was programmed to do, our replication server had preserved every single transaction that had been committed to our database. That included a subscription by some lucky person just seconds before the 12:10 a.m. disk crash.

To get our server back to normal, all we had to do was swap in three spare drives (yes, we had them on hand), reinstall our operating system and code, and repopulate our database from the replication machine.

Believe me, all this takes more than 60 minutes. Several WS staffers worked day and night Oct. 13 and 14 to restore our server and bring you today's articles. We're ba-a-a-ck!

Being down for 48 hours was a living hell, but our disaster plan was never designed to guarantee 99.999% uptime. That's always been way too expensive. Instead, we're obsessed with never losing one byte of reader data.

If you're a subscriber, you remain a subscriber. If your paid sub expires on Dec. 31, you're darn tootin' it still does. If you purchased a lifetime subscription ... well, we can't tell you the end of your lifetime, but we didn't know that before the crash, anyway.

I was seriously tempted to fire the individual responsible for the outage — me — but I decided to extend mercy to me. After all, if I don't forgive me for my lack of psychic abilities, who will?

This week's disk crash was unrelated to the electrical blaze on July 3–4 that knocked offline the hardened colocation facility we use in Seattle (which I reported on July 9). But outages such as these have made us more interested in moving to virtual servers (as described by ShareVM.com).

Virtual-server complexes, like Rackspace's Mosso and Amazon's Elastic Compute Cloud (EC2), are located in special data centers. If one machine goes down, or an entire data center loses power, identical servers in another location can instantly take over. The cost of such services has plummeted in recent years.

Well, if virtual servers are so great, why is Windows Secrets still hosted on a single server that can go down at any time?

The answer is that virtual servers present unique reliability and security issues, as Rob outlines today. It's true that all Web servers are "in the cloud," in the sense that they are "on the Internet." But cloud computing is a different animal, and it deserves to be done right.

I can assure you that, if Windows Secrets moves to virtual servers, they'll be fast and they'll be secure. Stay tuned in the months to come, and I'll keep you informed about our efforts to achieve true 100% uptime.

Brian Livingston is editorial director of WindowsSecrets.com and co-author of Windows Vista Secrets and 10 other books.

Help people find this article on the Web (explain):