By Henry Blodget
In addition to taking down the sites of dozens of high-profile companies for hours (and, in some cases, days), Amazon’s huge EC2 cloud services crash permanently destroyed some data.
The data loss was apparently small relative to the total data stored, but anyone who runs a web site can immediately understand how terrifying a prospect any data loss is.
(And a small loss on a percentage basis for Amazon, obviously, could be catastrophic for some companies).
Amazon has yet to fully explain what happened when its mission-critical and supposedly bomb-proof systems crashed, but the explanation will be important. As will the explanation for how the company could have permanently destroyed some of its customers data.
In our experience, the “back-up” systems of most web-services providers leave a lot to be desired. The back-ups sound reassuring in theory–you are assured that your data is always “backed-up” on a system that is completely separate from the main one and that you’ll be able to access it whenever you need it. But then, when you dig, you often discover that that means the data is simply copied to another file on the same box or another box in the same data room.
A stronger “backup,” obviously, would be housed in a separate location, so that a power-failure or flood or earthquake or other disruption at the main site would not disrupt the backup. Or, better yet, the back-up would be automatically replicated at multiple sites, all independent of one another, in near real-time.
And, of course, this is the sort of reliability that Amazon has been selling with its cloud services–including 99.9% uptime. Both promises seem to have been broken here.
Here’s an email Amazon sent to a big customer letting them know that some of their data was gone for good. You’d think that, under the circumstances, Amazon could do a bit better than an impersonal “hello.”
A few days ago we sent you an email letting you know that we were working on recovering an inconsistent data snapshot of one or more of your Amazon EBS volumes. We are very sorry, but ultimately our efforts to manually recover your volume were unsuccessful. The hardware failed in such a way that we could not forensically restore the data.
What we were able to recover has been made available via a snapshot, although the data is in such a state that it may have little to no utility…
If you have no need for this snapshot, please delete it to avoid incurring storage charges.
We apologize for this volume loss and any impact to your business.
Amazon Web Services, EBS Support
This message was produced and distributed by Amazon Web Services LLC, 410 Terry Avenue North, Seattle, Washington 98109-5210
And here’s how one Amazon customer, Chartbeat, passed on the news of some lost data to its users:
Last week, Amazon experienced a massive service outage that affected many companies, including chartbeat. As a result, some chartbeat clients were temporarily unable to log in to their dashboards and may have seen gaps in their historical data.
All issues have since been resolved and the historical data is back in the visual timeline. Approximately 11 hours of historical data wasn’t recoverable and will appear as small gaps in the timeline. Our development team is also hard at work to limit the impact of any future AWS interruptions.
We sincerely apologize for any inconvenience you may have experienced and can be reached at firstname.lastname@example.org to answer any questions you have about the outage or about chartbeat.
Many days after the crash, Amazon still hasn’t gotten its systems fully up and running again. A glance at the live “status” page for AWS still shows some red.