HPC YEARBOOK 2012

Workload management HPC 2012

“Many HPC administrators are faced with the issue of how to cope with occasional jobs that demand huge amounts of compute power – but only temporarily”

soſtware could deal with cloud computing,’ he says. ‘Tis is something that companies like us are working on, and we already have a working solution. We can now build a complete cluster within the cloud, or extend an existing cluster into the cloud. ‘For Big Data, we’re finding that our

soſtware is being used outside HPC, for example in Hadoop clusters. Cluster management soſtware can help with provisioning, monitoring, configuration management, user management, and so on, all of which you need in Hadoop clusters.’ One issue that faces the industry is the

Black Hole Node Syndrome, where one or more compute nodes is unhealthy in a subtle way. Te workload manager continues to submit jobs, but these jobs will crash because the node is not healthy. But, as far as the workload manager is concerned, the job is finished, so it sends another one, and another one and so on. Tese Black Hole Nodes, therefore, can suck all the jobs from a queue very quickly and create a significant productivity issue. ‘Tis problem can be addressed by a

cluster manager and workload manager working together,’ says van Leeuwen. ‘Te cluster manager tends to know a lot more about the cluster hardware and soſtware metrics, and can therefore establish whether or not a node is healthy. It then warns the workload manager, and lets the administrator know that action needs to be taken.’

Workload migration Many HPC administrators are faced with the issue of how to cope with occasional jobs that demand huge amounts of compute power – but only temporarily. One either has to have an HPC set-up that has huge amounts of over-capacity for most of the time, or one that is able to deal with the majority of day-to-day tasks, but cannot handle these occasional large jobs. Cycle Computing has developed a series

of tools that can help HPC administrators cope with these occasional peaks via dynamic

26

provisioning and data scheduling. ‘We start with the application, rather than the data,’ says Jason Stowe, Cycle’s CEO. ‘Different applications treat data in different ways, such as the way they reference data, where the data is stored and so on. Once we understand that, we then have a better idea of how to spread the load.’ Trough its utility supercomputing

products, Cycle enables users to create on-demand compute environments via a combination of available nodes within a local set-up and those available via the cloud. So, Cycle works alongside providers of cluster management and workload management soſtware providers, which tend to deal only with local provisioning. ‘We’re “Switzerland” when

Once focus instead on the applications stacks, and if the internal nodes are full, it will determine the availability of external nodes, for example in the cloud. Our CycleCloud product then helps with provisioning – that is, in finding available nodes, for example, in the public cloud.’ Using this technique, Cycle worked with

Further information

it comes to workload and cluster soſtware,’ says Stowe, commenting on the company’s neutrality: ‘We can work with open source or commercial packages, though we find most of our customers tend to use open source tools. We work with those in life sciences, financial services, insurance, manufacturing and visual effects and rendering.’ CycleServer has a ‘Submit Once’ feature,

Altair www.altair.com

Adaptive Computing www.adaptivecomputing.com

Bright Computing www.brightcomputing.com

Cycle Computing www.cyclecomputing.com

Platform Computing www.ibm.com

a major drug design company to develop a 50,000 core HPC. ‘Te customer was using a more complex algorithm than it had ever done before,’ says Stowe, ‘and rather than looking at two to three million compounds, they were now looking at 21 million. Teir workload was basically two orders of magnitude greater than their usual one. If they had run it on their internal environment, it would have taken months. We created a 50,000 core cluster that ran across multiple data centres in multiple regions, and in public cloud facilities. All of these were computing the same workload at the same time, with our soſtware dynamically placing various aspects of the

workload inside of that environment. We call this workload migration. ‘We’re trying to educate people that they

which uses metadata about the application to know which pieces of data need to be replicated where in order to complete the job efficiently. ‘Our soſtware doesn’t distribute tasks into a scheduling environment,’ says Stowe. ‘It uses a scheduler to execute workloads, but it doesn’t implement the scheduler. CycleServer and Submit

need no longer think in a constrained way about what they might be able to run in a reasonable amount of time on what local resources they already have. Te moment you place those constraints on a project, you’re compromising the validity of the research you’re doing.’ l Additional reporting by Beth Harlen.

Viviamo/Shutterstock

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32