Tidbits: Amazon Offers Explanation For S3 Outage

Saturday, February 16, 2008

Amazon Offers Explanation For S3 Outage

In a highly unusual occurrence for Amazon, the company’s S3 storage service had an outage yesterday. It left many people scratching their heads. Late yesterday evening, the corporation has decided to take the open explanation path and let the public know what happened behind the scenes.

Nicholas Carr of Rough Type passed along the explanation that Amazon gave to everyone as to what caused this problem. It seems that at approximately 3:30 AM PST on the 15th, a user started sending a high volume of authentication requests through to one of the S3 data centers. Amazon saw this happening, but did not move more capacity through at that time. And when another customer slammed them with requests less than half an hour later, the downfall of S3 commenced. The authentication service was pushed over its maximum level, and it took them time to move more capacity into the right areas. Essentially they ended up with a denial of service attack.

Whether intentional or not, we don’t know. Considering the size of some S3 customers, one has to wonder how two customers could have suddenly needed to send that many requests through to the authentication server. (According to a supposedly leaked internal email that Mr. Carr shows on his blog, at least one Amazon executive wonders the same thing.)

As the S3 storage service is used by services as large as Twitter, this could have had some fairly serious ramifications if it had happened in the middle of the day.

While the questions are sure to linger for some time, Amazon is already taking steps to ensure things like this can not happen again. They are taking steps to up their capacity, build a service health dashboard for customers and add more defenses around the authentication systems.

It is refreshing to see a company being so forthright with their explanations and pointing out their systemic failings. What happened here should have been avoidable. Considering how they tout their service to be greatly scalable with superb uptime, they will have to make sure something like this doesn’t happen again.

ShareThis