This page contains a Flash digital edition of a book.
FOCUS UPTIME UPDATE


Issue 19, January 2012


FOCUS UPDATE: UPTIME


S


tatistically, most data center outages are attributed to human error, according to data collected by the Uptime Institute. Some of this year’s most high-profile outages, however, were caused by failure of the automated failover mechanisms, either on the power infrastructure or the network side.


Regardless of the root cause, data center outages are expensive. The Ponemon Institute calculated that an outage can cost a company about US$1.02m. The institute conducted a survey, analysing downtime costs at 41 US data centers, releasing the results in May.


Here are some of the loudest downtime incidents from the last year:


GLOBAL OUTAGES FOR RIM


In October, Research In Motion (RIM), maker of the BlackBerry smartphones, experienced an outage of its infrastructure. While it first affected customers in Europe, the Middle East and Africa, the effects spread to the Americas the following day.


The company traced the issue to failure of a core switch at one of its data centers. “Although the system is designed to failover to a back-up switch, the failover did not function as previously tested,” RIM representatives said. “As a result, a large backlog of data was generated and we are now working to clear that backlog and restore normal service as quickly as possible.”


Issues persisted over three days, causing intermittent service delays for many customers. To compensate them, RIM offered free downloads of some BlackBerry App World premium applications, for which customers usually have to pay.


NATURE SINKS GOOGLE’S CLOUD


Google’s App Engine Datastore services went down in August. The company traced the cause to a thunderstorm that interrupted utility power to a Google data center in the American Midwest. In this case, the automatic-failover mechanism for switching to generator power failed to do its job.


42 www.datacenterdynamics.com


WHEN THE FAILOVER FAILS From BlackBerry to Amazon, 2011’s outages show that global infrastructures also mean global-scale failures. By Yevgeniy Sverdlik


Google’s Ikai Lan wrote in an email to App Engine customers: “Power distribution equipment in the data center failed in the wake of the loss of utility power, which powered off a subset of the machines in the data center.”


The outage caused loss of a portion of the compute and storage capacity supported by the data center, leading to high latency, server errors and even total downtime for App Engine master-slave Datastore applications.


Google did not specify why the data center’s electrical


systems failed to switch to


generators when it lost utility power, or how long the facility remained without power.


The App Engine team performed an emergency failover at the application level, migrating affected applications to a back-up data center. As a result, some applications appeared to “jump backwards in time” as they came back up. This happens because data written to the primary data center during the period immediately preceding the outage does not get migrated.


AMAZON’S DUBLIN FIASCO


A utility supplying power to an Amazon data center in Dublin first blamed stormy weather for a power outage that affected the facility but then retracted the initial diagnosis that a lightning strike had taken out a 10MW transformer. According to Amazon, the facility failed to switch to back-up generators after it lost utility power. The Amazon Web Services (AWS) team said it believed the data center’s programmable logic controllers (PLCs), which synchronize electrical phases between generators, were to blame.


A PLC at the facility detected a ground fault and failed to complete its task, leading to the data center outage because most of the data center’s back-up generators were disabled. The outage affected Amazon’s Infrastructure- as-a-Service


businesses, Elastic Compute


Cloud (cloud servers) and Elastic Block Store (cloud storage), and its cloud database service called Relational Database Service. Cloud instances of these three services hosted in Dublin felt most of the effect.


In addition to C4L, affected included


online marketing service multimedia BlueLevel, among many others. n


customers Initial


Rewards, email marketing company Easy Inbox, and online


agency


Amazon said nearly all EC2 zones instances and about 60% of EBS volumes in the zone went down. Networking gear connecting the zone to the Internet and other availability zones in the region also went down, causing connectivity issues resulting in customers receiving API errors. The AWS team said it would make a number of changes to the data center to prevent such issues from reoccurring. The changes included adding redundancy and more isolation for the PLCs to insulate them from failures.


TELECITY SUFFERS IN DOCKLANDS


A power outage at Telecity’s Meridian Gate data center in London’s Docklands in July caused disruptions to companies colocating there and to their customers. The provider traced the outage to a “fault on a breaker in the power distribution system”, according to a note it sent to one of the affected customers, network connectivity provider C4L. The network provider said all its customers were kicked offline and its 10G ring was broken by the outage. Power to the facility was restored within 20 minutes.


In an emailed statement, Telecity said: “We have resolved a power outage that affected our Meridian Gate data center earlier today. Our engineers responded quickly and restored power to the facility in around 20 minutes. We kept all affected customers informed throughout the process and have apologized for the disruption.”


Page 1  |  Page 2  |  Page 3  |  Page 4  |  Page 5  |  Page 6  |  Page 7  |  Page 8  |  Page 9  |  Page 10  |  Page 11  |  Page 12  |  Page 13  |  Page 14  |  Page 15  |  Page 16  |  Page 17  |  Page 18  |  Page 19  |  Page 20  |  Page 21  |  Page 22  |  Page 23  |  Page 24  |  Page 25  |  Page 26  |  Page 27  |  Page 28  |  Page 29  |  Page 30  |  Page 31  |  Page 32  |  Page 33  |  Page 34  |  Page 35  |  Page 36  |  Page 37  |  Page 38  |  Page 39  |  Page 40  |  Page 41  |  Page 42  |  Page 43  |  Page 44  |  Page 45  |  Page 46  |  Page 47  |  Page 48