SCW_DEC11JAN12

workload management

so his company has stepped in to further develop the product and add functions. It allows users to harness the combined computing power of desktops, servers and clouds. The Univa/Rightscale platform, dubbed ‘One Click HPC’, is targeted at the large base of Grid Engine users along with scientists and engineers who want to set up on-demand resources, but who have little IT training, and Tyreman says that someone can build a cluster in 10 minutes even if they’ve never done it before. Univa can then distribute jobs to private and public resources transparently. It’s also interesting to get the

Combining workload managers with health checks with software such as that from Bright Computing helps avoid the ‘black hole node syndrome’

viewpoint of those who don’t create this middleware, but do sell it with their systems. One example is HPC integrator OCF, where operations director Russell Slack notes that, in avoiding wasted resources, workload managers are becoming more complex and give users far must than just scheduling. His experience shows that lots of people are still using open source middleware and that it takes a lot to move them from that, and that this type of software is more popular in the scientific/university environments. In contrast, he sees that commercial customers tend to stick with commercially supported products.

are done on the fly. The organisation is at risk if the admin leaves or moves on to another role.

The ‘black hole node’ syndrome Bright Computing’s Blessing also points to a benefit of having a workload manager tightly integrated with other tools to perform health checks – it avoids the dreaded ‘black hole node syndrome’, which silently and randomly kills productivity in HPC clusters. Although a workload manager reports that all nodes are running, sometimes a

THERE ARE MANY SUBTLE PROBLEMS FROM

APPARENTLY HEALTHY NODES THAT CAN CREATE CASCADING JOB FAILURES – AND THE BIGGER THE JOB, THE MORE NODES, THE HIGHER THE PROBABILITY OF FAILURE

Similarly, Bright Computing doesn’t itself

have a workload manager, but the company allows you to install your favourite one as part of its overall content management system and then have all the various components tightly integrated under ‘a single pane of glass’. Mark Blessing, VP of marketing adds

that this contrasts to the toolkit approach taken by open source and other commercial software, which gets in the way of providing such a level of integration. This toolkit approach also creates risk, because whenever one tool in the kit is revised, it cannot be taken for granted that the other tools will continue to function as before. This synchronisation problem requires a lot of vigilance on the part of the HPC administrator as well as complex scripting to keep everything working together. Typically, these scripts are not documented, as they

26 SCIENTIFIC COMPUTING WORLD

job executes, but sometimes it crashes, leaving few clues for even the best system administrators to fix the problem. Or it might result in a ‘cascade crash’, where the workload manager continues sending jobs to nodes with problems, and in the worst cases, all the compute jobs can be flushed from the queue for no apparent reason. Combining workload managers with

health checks with software such as that from Bright Computing helps avoid the ‘black hole node syndrome’. ‘The black hole node syndrome is a

serious issue,’ confirms Dr Don Holmgren, computer services architect at Fermilab. ‘There are many subtle problems from apparently healthy nodes that can create cascading job failures – and the bigger the job, the more nodes, the higher the probability of failure.’ Fermilab runs nearly two million jobs each year on its HPC

clusters, comprising 21,000 cores in aggregate. Not all of them make it through to completion; the black hole node syndrome crashes a substantial number of these jobs. In spite of efforts to contain the problem, Holmgren estimates that 0.5 per cent of jobs fail as a result of unhealthy nodes, netting out to about 9,000 jobs per year, pointing out that even the largest HPC operations with extremely skilled staff can suffer from this problem. ‘It’s painful for users to resubmit their affected jobs, especially when the cluster continues to perform correctly for other jobs,’ continues Holmgren, ‘and it’s not always evident that there is a problem at

first. The scheduler continuously assigns new jobs to the unhealthy nodes. We don’t realise there may be a problem until we notice that an extremely high rate of job failure has occurred on part of the cluster.’ In recent surveys conducted by Bright

Computing, more than 64 per cent of respondents report having been impacted by the black hole node syndrome. Many users have worked to prevent job crashes: 23 per cent of respondents reported that they have written scripts to prevent the problem, while 14 per cent have purchased software to address it. Another 27 per cent report that the problem ‘still drives me nuts’. The remaining 36 per cent are either not impacted or do not realise there is a name for what is killing their jobs, whether it be called the ‘black hole syndrome’ or ‘queue busters’ or something else. Examples of node ‘illnesses’ capable of

crashing jobs include: l A GPU driver that failed to load; l An unmounted parallel file system; l A full scratch disk; l A malfunctioning InfiniBand adapter; l An irregular system clock; l SMART errors on the disk drive; system services not running; and

l External user authentication not working properly.

With Bright Cluster Manager, specific low-impact tests are run just before a job is executed to identify illnesses that can affect jobs. It then instructs the workload manager to hold a job briefly while the nodes reserved for it and other system elements are tested. If any node fails the health check, predefined actions are executed. Bright and the workload manager then dynamically reschedule the job to a set of healthy nodes while alerting the system administrator.

www.scientific-computing.com

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40