This page contains a Flash digital edition of a book.
Workload management HPC 2012


“Many HPC administrators are faced with the issue of how to cope with occasional jobs that demand huge amounts of compute power – but only temporarily”


soſtware could deal with cloud computing,’ he says. ‘Tis is something that companies like us are working on, and we already have a working solution. We can now build a complete cluster within the cloud, or extend an existing cluster into the cloud. ‘For Big Data, we’re finding that our


soſtware is being used outside HPC, for example in Hadoop clusters. Cluster management soſtware can help with provisioning, monitoring, configuration management, user management, and so on, all of which you need in Hadoop clusters.’ One issue that faces the industry is the


Black Hole Node Syndrome, where one or more compute nodes is unhealthy in a subtle way. Te workload manager continues to submit jobs, but these jobs will crash because the node is not healthy. But, as far as the workload manager is concerned, the job is finished, so it sends another one, and another one and so on. Tese Black Hole Nodes, therefore, can suck all the jobs from a queue very quickly and create a significant productivity issue. ‘Tis problem can be addressed by a


cluster manager and workload manager working together,’ says van Leeuwen. ‘Te cluster manager tends to know a lot more about the cluster hardware and soſtware metrics, and can therefore establish whether or not a node is healthy. It then warns the workload manager, and lets the administrator know that action needs to be taken.’


Workload migration Many HPC administrators are faced with the issue of how to cope with occasional jobs that demand huge amounts of compute power – but only temporarily. One either has to have an HPC set-up that has huge amounts of over-capacity for most of the time, or one that is able to deal with the majority of day-to-day tasks, but cannot handle these occasional large jobs. Cycle Computing has developed a series


of tools that can help HPC administrators cope with these occasional peaks via dynamic


26


provisioning and data scheduling. ‘We start with the application, rather than the data,’ says Jason Stowe, Cycle’s CEO. ‘Different applications treat data in different ways, such as the way they reference data, where the data is stored and so on. Once we understand that, we then have a better idea of how to spread the load.’ Trough its utility supercomputing


products, Cycle enables users to create on-demand compute environments via a combination of available nodes within a local set-up and those available via the cloud. So, Cycle works alongside providers of cluster management and workload management soſtware providers, which tend to deal only with local provisioning. ‘We’re “Switzerland” when


Once focus instead on the applications stacks, and if the internal nodes are full, it will determine the availability of external nodes, for example in the cloud. Our CycleCloud product then helps with provisioning – that is, in finding available nodes, for example, in the public cloud.’ Using this technique, Cycle worked with


Further information


it comes to workload and cluster soſtware,’ says Stowe, commenting on the company’s neutrality: ‘We can work with open source or commercial packages, though we find most of our customers tend to use open source tools. We work with those in life sciences, financial services, insurance, manufacturing and visual effects and rendering.’ CycleServer has a ‘Submit Once’ feature,


Altair www.altair.com


Adaptive Computing www.adaptivecomputing.com


Bright Computing www.brightcomputing.com


Cycle Computing www.cyclecomputing.com


Platform Computing www.ibm.com


a major drug design company to develop a 50,000 core HPC. ‘Te customer was using a more complex algorithm than it had ever done before,’ says Stowe, ‘and rather than looking at two to three million compounds, they were now looking at 21 million. Teir workload was basically two orders of magnitude greater than their usual one. If they had run it on their internal environment, it would have taken months. We created a 50,000 core cluster that ran across multiple data centres in multiple regions, and in public cloud facilities. All of these were computing the same workload at the same time, with our soſtware dynamically placing various aspects of the


workload inside of that environment. We call this workload migration. ‘We’re trying to educate people that they


which uses metadata about the application to know which pieces of data need to be replicated where in order to complete the job efficiently. ‘Our soſtware doesn’t distribute tasks into a scheduling environment,’ says Stowe. ‘It uses a scheduler to execute workloads, but it doesn’t implement the scheduler. CycleServer and Submit


need no longer think in a constrained way about what they might be able to run in a reasonable amount of time on what local resources they already have. Te moment you place those constraints on a project, you’re compromising the validity of the research you’re doing.’ l Additional reporting by Beth Harlen.


Viviamo/Shutterstock


Page 1  |  Page 2  |  Page 3  |  Page 4  |  Page 5  |  Page 6  |  Page 7  |  Page 8  |  Page 9  |  Page 10  |  Page 11  |  Page 12  |  Page 13  |  Page 14  |  Page 15  |  Page 16  |  Page 17  |  Page 18  |  Page 19  |  Page 20  |  Page 21  |  Page 22  |  Page 23  |  Page 24  |  Page 25  |  Page 26  |  Page 27  |  Page 28  |  Page 29  |  Page 30  |  Page 31  |  Page 32