Amazon’s Unknown Unknowns

Operating the world’s largest public cloud still leaves you with the problem of “unknown unknowns.”
The phrase is associated with the then-defense secretary Donald Rumsfeld, who in 2002 described the problems of certainty with regards to war. The reality is no less acute for Amazon as it searches for ways to avoid another widespread outage of its public cloud business, Amazon Web Services.
“We ask ourselves the question all the time,” said Adam Selipsky, vice president at A.W.S. “There is a list of things you don’t know about.”
On Dec. 29, Amazon Web Services published a detailed explanation of what went wrong Christmas Eve, when Netflix and many other customers (Amazon won’t say how many) had service disruptions. The company also described what it was doing to ensure the problem would not happen again.
The short version: An A.W.S. developer inadvertently took out part of the software that makes sure data loads are properly distributed across the computers in a computer center in Virginia. New activity on the site was slowed or failed altogether. Netflix was unable to supply streaming movies. Service was not fully restored for almost 12 hours.
What Amazon didn’t go into is how much it is trying to figure out what else might go wrong in an increasingly complex system. Part of that effort consists of publishing explanations (this one was notably full of information), and part consists, as in warfare, of lots more scenario planning.
A.W.S. tries to figure out how its complex global network of perhaps a half-million servers could break by having one team look at the work of another, by bringing in top engineers from elsewhere in Amazon and, occasionally, by hiring outside experts in performance and security.
“Running I.T. infrastructure in a highly reliable and cost-effective fashion is hard,” Mr. Selipsky said. “We’re able to put more resources on that than the vast majority of our customers can. That said, there is no substitute for experience.”
There is also a deep technical lesson in the recent outage. Mostly engineers look at code for flaws, but the operational breakdown at A.W.S. on Dec. 24 seems to have been a result of human organizational error, not the software itself.
In the future, developers will need specific permission to change the data load balancing equipment. What isn’t clear is whether A.W.S. can examine all of its own management practices to see if there are other such decision-making flaws.
Mr. Selipsky, understandably, could not say whether everything else was solid, though he did note that Amazon had eight years of experience running and managing big systems for others.
To be sure, such breakdowns happen all the time with old-style corporate servers. While they don’t have the dramatic effect of an A.W.S. failure, they may collectively represent much more downtime.
In a study commissioned by Amazon, IDC said downtime on A.W.S. was 72 percent less per year than on conventional corporate servers. While the financing of the study makes its conclusions somewhat questionable, its results are in line with similar studies about online e-mail systems compared with in-house products.

Source:
http://bits.blogs.nytimes.com/2013/01/08/amazons-unknown-unknowns/

0 yorum: