SCW_JUNJUL13

HPC cooling Advancing computing

François Robin, Jean-Philippe Nominé and Torsten Wilde share PRACE’s perspective on HPC cooling

I

n 10 years, the peak performance of high-end supercomputers has increased by roughly three orders of magnitude: 10 years ago we were in the teraflop era while now we are in

the petaflop era. Te energy efficiency of the supercomputer has greatly improved in terms of flops per Watt, but not at the same speed. Typically, a high-end supercomputer uses between five and 10 times more energy than a decade ago. Such an increase in power consumption,

combined with a higher cost of energy in most countries and a greater awareness of the environmental impact, have lead HPC centres to put a high priority on power efficiency. Tis can be addressed by implementing more energy- efficient IT equipment and infrastructure. Regarding the HPC centre infrastructure, the

largest energy usage was usually by the cooling system (the second is usually electricity losses in uninterruptible power supplies (UPS) and transformers). Tis was the first motivation to change the way supercomputers were cooled in order to reduce the total cost of ownership (TCO). Te second motivation was related to the increase of power consumption per rack; 10 years ago, 10 kW per rack was a typical value, while nowadays 40 kW per rack is usual. In the future it will probably go to 100 kW per rack, and possibly higher. In terms of cooling methods, 10 years ago

most supercomputers were air-cooled, with free flow of air in the computer rooms. To improve the flow of air, racks were organised to alternate cold and hot aisles. Nowadays, air-cooling is still used but either hot or cold aisles are enclosed in order to avoid re-circulation of air. Liquid-cooling is used more oſten, either at the rack level (rear door heat exchangers) or at the component level (cold plates on the most power hungry components). In addition, two important trends are gaining momentum: ‘free’ cooling (no chillers) and heat reuse. Tere are several challenges: energy

efficiency (reduction of TCO), environmental considerations (Green IT) and power density. Another important point for HPC facilities is flexibility; the lifetime of an HPC facility is typically 20 to 30 years, therefore a facility designed or refurbished today must take into account as much as possible expected

40 SCIENTIFIC COMPUTING WORLD

requirements of future systems (Exascale and beyond). In addition, the temperature of components is an important point as operating components at higher temperatures may increase the number of failures and increase power consumption at the IT equipment level. In this context, many discussions on cooling

focus on the components – as mentioned before, air-cooling is still in use, while liquid-cooling is becoming more popular. For air-cooling, the best ways to organise

airflow are still a subject of discussion (hot or cold aisle enclosure). Computer simulations are oſten used; measurements are needed to confirm the results of simulation and oſten show they are not as accurate as expected. For liquid-cooling, rear-door heat exchange is

a mature technology that works well for racks up to 40 kW. One benefit is that it enables ‘room- neutral racks’, meaning there is no requirement for computer room air-conditioning. Te main limitation of this technology is that it requires inlet water at relatively low temperature – which

FOR AIR-COOLING, THE BEST WAYS TO ORGANISE AIRFLOW ARE STILL A SUBJECT OF DISCUSSION

is, in most cases, incompatible with free-cooling and efficient heat reuse. Direct liquid-cooling of components doesn’t have such drawbacks since it uses liquid at a higher temperature, without impacting the temperature of operation of components (which, in some cases, is lower than with other cooling technologies). Implementing a liquid-cooling system

involves a lot of plumbing and a close coupling between the facility and the IT equipment, which means a lot of discussion between IT and infrastructure teams and implementing a global system for monitoring and optimisation. In some cases, it may be necessary to combine both cooling technologies: ‘warm’ water for direct liquid-cooling of high heat-production components (like CPU, memory, accelerators); ‘cold’ water for rear-door heat-exchangers to cool other components (like network switches, disks,

etc.) that are still air-cooled for (lower) power density reasons. In terms of free-cooling and heat reuse,

as mentioned before, chillers are needed for producing cold (chilled) water. Using warm water-cooling, or even air-cooling when properly designed, makes it possible to use free-cooling – which means no need for chillers. In the warm water case, the facility water loop is connected directly to heat exchangers in which the water is cooled by the outside air and possibly by other means (lakes or rivers, for example). In the air- cooling case, outside air is, aſter filtering, directly pushed into the computer room by fans. For most HPC facilities, free-cooling is

possible most of the year. In some cases, chillers are kept for the few days or weeks when free- cooling is not possible. A preliminary study of these conditions is needed when thinking about free cooling. Regarding heat reuse, warm water-cooling makes heat reuse (for example for heating offices) much easier due to the higher water temperature. In the case of air-cooling this can also be achieved by pushing the air heated by the IT equipment in offices, but it is usually less efficient than the use of hot water. Industry is aware of these challenges and

provides/plans to provide cooling systems suitable for dealing with these challenges. Collaboration/joint work, including R&D and early testing with big sites, makes it possible to develop a solution suited to the needs of large sites but also usable in small sites. It is very important to put in place a tool for

analysing, monitoring and recording all the operations data of the facility and of the IT equipment. Tis makes it possible to control and tune existing optimisation strategies and to find new ones. It is also worth mentioning the trend towards increasing the temperature in the computer rooms. Tis leads to savings in terms of cooling but should be considered with care, because: • Increasing the room temperature may increase the failure rate of components; and

• Increasing the temperature of operation of components may increase the power consumption of the components (leading to an overall increase of the power consumption).

www.prace-project.eu/ @scwmagazine l www.scientific-computing.com

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40 | Page 41 | Page 42 | Page 43 | Page 44 | Page 45 | Page 46 | Page 47 | Page 48 | Page 49 | Page 50 | Page 51 | Page 52