Even Google Has the Outage Blues

As has been widely noted, Google’s popular Gmail service suffered a multi-day outage.  Google says that they were rolling out a software storage update on February 27, 2011 and ran into a bug that wiped out the e-mail of 0.02% of their customers. Fortunately, the data was backed up on tape (yes, that’s right, The Cloud is made up partly of Magnetic Tape!).

Univac magnetic tape backupThis is just the kind of event that every tech person dreads. I know that firsthand. Last month, one of our company’s hosted servers went down and did not come back up.

There is nothing quite like the first few minutes after you have discovered a massive failure.  Except for the minutes after that, and the hours after that. Nothing tests a contingency plan like a real life meltdown.

Our (former) hosting provider had sent plenty of advanced notice of pending power distribution system upgrade that would require our box to be shutdown.  Re-reading those e-mails after the fact, they seem prescient.  “We are notifying you in advance to allow sufficient time to prepare.” Blah blah blah.

The box was an old one, having run for over 9 wonderful years with the same hard drive (and no RAID configuration). There’s a word for that: stupid. But we had backups of all the customer data on the box, of all of our software, and of critical configuration files. The one thing not properly planned for was the evolution of various libraries over the years. As a result, some circa 2000 C++ code didn’t want to compile properly on the modern Linux distribution running on the replacement machine. The original compiler had been far more forgiving.

New technology to the rescue.

Thanks to virtualization, we ran some of the cranky old code on an old Linux distribution in a VM on a macbook pro laptop while we cleaned up.  Thanks VMware!

And our new configuration is very different, with thoroughly modern protections and backups.

Now all we have to do is update that contingency plan …

