SCW_FEBMAR16

high-performance computing

Could cloud computing become not just an alternative but a dominant way of executing HPC workloads?

As the main article makes clear, hardware configuration has been a barrier to the wider adoption of HPC in the cloud; the main ‘public cloud’ providers, such as Amazon, Google, and Microsoft, have – understandably -- invested in commodity hardware that is adapted to commercial and business computing needs, but is less well configured for many tightly coupled high-performance computing workloads. There are parallels with the way in which Beowulf clusters started to change the face of high-performance computing just over 20 years ago. Clusters used commodity components, lowered the financial cost of owning a supercomputer, and thus widened the pool of users who could run high-performance

➤

workloads. But the configuration of the hardware was different from what people were used to, and they had to change the way they wrote their programs.

Just as clusters changed the way

supercomputers were programmed, so new developments are adapting HPC to the cloud. Marcos Novaes, cloud platform solutions engineer with Google, has no doubts: ‘I think this is the year in which we will see a radical shift in HPC’.

As a self-confessed ‘HPC old timer’ – he worked on the design of the IBM SP2 computer, that was once the second largest in the world, at the Lawrence Livermore National Lab in 1999 – he is realistic about the limitations of the cloud. Traditionally, HPC uses

also offer a choice of infrastructure: does this user want to run bare metal; does the other user want to run a virtual machine; and yet another user in a container?’ Te latest release of the soſtware, in January, offers a scenario for bursting not just from bare metal, but also from a private to the public cloud. Failure to observe Van Leeuwen’s helpful

clarification may contribute to some of the current confusion surrounding the exact role of the cloud in HPC. While Intersect360 reports only a three per cent uptake, IDC, another market research organisation, reported last year a significantly higher take-up of the cloud, at around 25 per cent, a doubling since 2011, in contrast to Intersect360’s view that demand has hardly increased, as yet.

Embarrassingly parallel jobs suit the cloud Although it may not yet have been translated into significant increases in usage, attitudes among end-users are changing, and interest in the cloud is growing. David Power, head of HPC at Bios-IT, said: ‘We are starting to see a bit of a shiſt in our job users’ perceptions. Some of the initial barriers to the cloud have been around the spec of the hardware from the larger cloud providers’. A few years ago, he continued, the cloud did not offer a performance benefit since most of the hardware ‘was a generation or two old, with no fast interconnect and relatively slow storage, and it was all virtualised’.

26 SCIENTIFIC COMPUTING WORLD

tightly coupled architectures: ‘Which basically means an InfiniBand interconnect with micro-second latency. It is hard to achieve such latency in a multi-tenant cloud environment without sacrificing the flexibility of resource allocation, which provides the best economy of scale. So, to date, most of the HPC workloads that we see moving to the cloud are embarrassingly parallel.’ ‘However, for the message- intensive workloads that are affected by latency, we do have a challenge that will require a different approach,’ Novaes continued. He believes that it can be found in recent work by Professor Jack Dongarra of the University of Tennessee and one of the pioneers of HPC since the Top500 list started using his

FOR INTERMITTENT USE, THEN IT MAKES TOTAL SENSE TO RENT

Both Power and Khosla cited Cycle

Computing as one of the pioneers, offering a service using Amazon Web Services to big pharma companies. According to Power, such genomics jobs are embarrassingly parallel, and so do not require heavy MPI communication between threads, cores, and different jobs. ‘Tat is where people began to realise there is some merit here’. Khosla took the same view: ‘What has worked well in the cloud are massively parallel applications that are not running for a long, long time and do not have sensitivity to storage or other compute nodes. Bio apps and pharma have worked well, mostly in the burst capability.’ In contrast, as Van Leeuwen pointed out, oil

and gas companies and those doing seismic processing require huge amounts of data and it would be far too time-consuming to upload it all to the cloud. Power reiterated the point: ‘We are starting to see the low hanging fruit of HPC workloads accepted as a decent fit for the cloud. Anything that’s loosely coupled, without too much heavy I/O, without having to move too much data in and out – I think they are decent candidates for cloud workloads. Whereas, if you look at the very high-memory requirement workloads, or highly parallelised jobs that run on 10,000 cores and above, they’re probably not

Linpack benchmark as the common application for evaluating the performance of supercomputers in 1993. Dongarra has developed a new Direct Acyclic Graph (DAG) scheduler called Parsec and a new parallel library called DPlasma to support heterogeneous computing environments, and which address the need to cope with higher latencies as well. According to Dongarra, the future of HPC is in algorithms that avoid communication. Novaes points out that the migration of HPC to the cloud means that ‘the move of HPC to a new communication-avoiding platform has already started, and we will see a very strong acceleration, starting this year, as such technologies become available’.

good candidates for cloud workloads today.’ Khosla’s assessment is similar; the cloud is not suitable for those cases where there are affinity- type requirements, and users are running on InfiniBand, requiring very high performance and low latency. ‘Tat’s been a show-stopper for most people. Other than Azure, which has just announced InfiniBand, that’s just not available.’

Pitfalls in the cloud Tere are also synergistic hardware/soſtware issues. Khosla pointed out that, for companies and organisations that have written their own applications in-house, ‘going to the cloud is not very easy. You have a lot of dependencies that the developers write in, that assume a static environment. So to go to the cloud, there has to be an effort to decouple them,’ and the code has to be re-written. Over the past year, the ability to spin up

resources in the cloud has got a lot better than, say, four years ago, Khosla observed, and X-ISS is seeing a number of organisations trying to get round the cost issue by opting for ‘spot pricing’. But this in itself can present challenges, he argued, because ‘the cloud is not always there. Nodes can disappear, because someone using spot pricing took them away. So you have to have applications handle the fact that this can happen more oſten than in your own environment – when they only go away when you have a hardware issue, which is not that oſten. So now you have to write your check-pointing, and your

@scwmagazine l www.scientific-computing.com

➤

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40 | Page 41 | Page 42 | Page 43 | Page 44

orderForm.title