DCS Europe July 2014

DCA REVIEW Resilience & Operational Best Practice

are wired to the rack should be part of the data center operations manual, and regular audits of connections within the racks avoid unknowingly relying on single points of failure.

It is not so easy to identify the link between the redundancy in the cooling infrastructure and the resilience of the IT to a cooling failure. It is not possible to document how the cooling moves from the cooling units to each individual server because the final stage in the delivery of the cooling is done through the invisible medium of air. You may have four redundant cooling units but how do you know if the air will be delivered to where it’s needed when you come to rely on them?

The use of air brings with it the other complication of variability. Cooling units in the data center supply air to a common area, and the air paths that form are all dependent upon one another. Take one (or more) units offline and the flow from the others rushes to fill the space left, changing the whole distribution pattern!

You may still have enough cooling for the space, but that not-so-critical hot spot might have just moved a lot closer to your core network switches. The inherent variability of airflow means that the only way to be certain about your IT resilience today is to deliberately fail your number of redundant cooling units in the worst case combination to see what happens.

Doing this in an operational data center is generally not an option. So for the majority of data center operators - who are generally familiar with power and networking - the answer to the question, “Do you have resilient cooling?” is a re-statement of the system’s redundancy. However, the true answer is that they do not know.

This is why so many operators remain nervous about their cooling performance and insist on an expensive, potentially-ineffective and overcautious approach that consists of over engineering a lot of cooling headroom. It is only those who have experienced a cooling failure who will really know whether their cooling system is resilient. Given the mission critical nature of nearly all data centers, what can be done?

We can learn from other industries that are faced with the same problem – industries where operators need to know what will happen in exceptional circumstances, but do so without having to experience it for real. In the automotive industry, for example, they use crash test dummies in a real car to find

18 www.dcseurope.info I Summer 2014

Fig 2. Airflow during failure

out what will happen to a car’s occupants during an accident without having to put any real people at risk. This shows us how using a model (instead of a real passenger) is a risk-free method of understanding the resilience of a complicated system during a potentially catastrophic event.

For data center cooling, testing can be done quickly and affordably in a computer model. It’s a proven science; virtual testing and prototyping with computer models is used in a vast array of applications where physical testing is not feasible.

Crowd movement at major events is a good example of this. When planning a large event like a street marathon or concert, planners need to ensure the safe flow of people.

Much like airflow in a data center, in an emergency situation this flow is likely to change from planned. For example, a fire may cut off certain routes. Computer simulations are used to make sure there are no pinch points that could cause a crush and that escape plans are resilient to the most likely crowd responses without putting real people at risk.

The one-off, upfront crash testing carried out before a car makes production is sufficient because the car is not intended to undergo any significant changes throughout its lifetime that would render the testing results irrelevant. The same could be said about the large scale crowd movements at major events. But a data center is different prospect altogether: over a long period of time, the

configuration of the data center is expected to deviate from its original design.

The churn rate of IT within a modern mission critical data center means any physical “crash testing” of the cooling resilience done at the design stage quickly loses relevance. Daily IT deployment operations will change cooling demands throughout the data center.

Only by testing the cooling system regularly can the IT resilience to failure truly be known. Computer modeling and simulation offer a way to run those tests at any point in the data center’s life without any risk to the devices and applications they are supporting.

Air movement around an entire data center can be accurately modelled using computational fluid dynamics (CFD). Working in the virtual world means air paths can be traced, allowing the cooling system to be visualised. Worst case cooling failure scenarios can be analysed to see the impact on the IT equipment, without putting any of it at risk in real life.

Like a single line diagram, it allows operators to see where there are single points of failure. But, even more than that, it also allows them to investigate why they have occurred and test out potential solutions.

Using CFD, cooling resilience moves from being an unknown quantity to a metric that can be calculated using physics-based simulations, helping operators make the most of their data center infrastructure and performance.

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40 | Page 41 | Page 42 | Page 43 | Page 44 | Page 45 | Page 46 | Page 47 | Page 48 | Page 49 | Page 50 | Page 51 | Page 52 | Page 53 | Page 54 | Page 55 | Page 56