SCW_DEC15JAN16

Jaguar

high-performance computing

The king is dead, long live the king

Robert Roe finds that upgrading legacy HPC

systems is a complicated business, and that some obvious solutions may not be the best options

U

pgrading legacy HPC systems relies as much on the requirements of the user base as it does on the budget of the institution buying the system. Tere

is a gamut of technology and deployment methods to choose from, and the picture is further complicated by infrastructure such as cooling equipment, storage, networking – all of which must fit into the available space. However, in most cases it is the

requirements of the codes and applications being run on the system that ultimately define choice of architecture when upgrading a legacy system. In the most extreme cases, these requirements can restrict the available technology, effectively locking a HPC centre into a single technology, or restricting the application of new architectures because of the added complexity associated with code modernisation, or porting existing codes to new technology platforms. Barry Bolding, Cray’s senior vice president

and chief strategy officer, said: ‘Cray cares a lot about architecture and, over the years we have seen a lot of different approaches to building a scalable supercomputer.’ Bolding explained that at one end of the

spectrum are very tightly integrated systems like IBM’s Blue Gene. Bolding continued: ‘Tey designed each generation to be an improvement over the previous one, but in essence it would be a complete swap to get to the next generation. Obviously, there are advantages and disadvantages to doing this. It means that you can control everything in the

www.scientific-computing.com l

field – you are only going to have one type of system.’ At the other end of that spectrum is the

much less tightly controlled cloud market. Although some cloud providers do target HPC specifically, the cloud is not fully deployed in mainstream HPC today. Bolding stressed that the current cloud market is made up of a myriad of different servers, ‘so you never really know what type of server you are going to get.’ Tis makes creating ‘very large partitions of a particular architecture’ difficult, depending on what type of hardware the cloud vendor has sitting on the floor. ‘Tere is very little control over the infrastructure, but there is a lot of choice and flexibility,’ Bolding concluded.

Adaptable supercomputing Cray designs its supercomputers around an adaptable supercomputing framework that can be upgraded over a generation lasting up to a decade – without having to swap out the computing infrastructure entirely. ‘What we wanted to do at Cray was to build those very

THE FOCUS AT

NASA IS ABOUT HOW MUCH WORK YOU CAN GET DONE

large, tightly integrated systems capable of scaling to very large complex problems for our customers; but we also want to provide them with choice and upgradability,’ said Bolding. He continued: ‘What we have designed

in our systems is the ability to have a single generation of systems that can last for up to 10 years. Within that generation of systems, as customers get new more demanding requirements, we can swap out small subsets of the system to bring it to the next generation. We build and design our supercomputers to have a high level of upgradability and flexibility; much higher than for instance, the IBM Blue Gene series.’

@scwmagazine

Jaguar super computer housed at Oak Ridge National Laboratory, before being upgraded to Titan in 2012

Keeping the plates spinning One critical point for organisations that provide continuous services as a result of their high- performance computing systems is that these services have to continue during upgrades. Te UK’s Met Office, which provides weather forecasting for government, industry, and the general public, is a case in point. Its HPC centre provides time-sensitive simulations in the form of weather reports, but also flood warnings and predictions for potentially catastrophic events such as hurricanes or storm surges. As such its system absolutely cannot go out of production, and any upgrades must be carried out seamlessly without disruption to the original system. Tis very specific requirement is also faced

by the system administrators at Nasa’s High- End Computing Capability (HECC) Project. Te HECC has to run many time-sensitive, simulations alongside their usual geoscience, chemistry, aerospace, and other application areas. For example, if a space launch does not go exactly according to plan, simulations will be needed urgently to assess if there was any significant damage, and how the space probe should be managed for safe re-entry and recovery. Apart from the safety issue of ensuring that a space probe does not re-enter the atmosphere in an uncontrolled fashion, with the risk of it crashing into a populated area – if it were a manned mission then the safety of the crew depends on fast and accurate simulations.

DECEMBER 2015/JANUARY 2016 23

➤

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40