HIGH PERFORMANCE COMPUTING
HPC maintenance
ROBERT ROE EXPLORES THE ROLE OF MAINTENANCE IN ENSURING HPC SYSTEMS RUN AT OPTIMAL PERFORMANCE
Employing the correct maintenance procedures can reduce the downtime of HPC
systems and help to predict the failure of systems or components. Increasing the time that a computing service is available also increases the amount of scientific output that a HPC system can generate. Russel Slack, operations director at
OCF, explains: ‘From my perspective and the perspective of OCF, maintenance is about ensuring service availability, so it is preventative and proactive.’ ‘Most of the customer feedback, when there are problems, is based on the service being down. System admins realise that it is not their shiny toy in the server room for them to log in to and administer – they are acutely aware that users are the important people here, so you have to focus on service availability for the user community.’ Ensuring efficient use of a
supercomputer is necessary not only to support the user community but also to generate sufficient return on investment for the organisation or enterprise that funds the HPC system. If a system is taken down unexpectedly
there will be a backlog of jobs – but also additional costs in running the datacentre, and getting the service fixed in a timely manner. These costs will continue to mount up with no return on investment until the service can be resumed. ‘If you are running a system for years and then it falls over because you have not patched, then lots of questions will be asked of the system administration team,’ said Slack.
There is no one size fits all OCF offers different options for maintenance contracts which are built into
4 Scientific Computing World June/July 2018 Left: COMBO A Right: COMBO B Riello UPS system
a service level agreement (SLA) prior to the installation of the system and tailored to a customer’s needs. ‘We will speak to the customer about their technical capability within the team and their requirements for uptime. Is it a research tool or a production tool?’ ‘This allows us to get a feel for their
ability to manage and maintain the system properly, and also the general expectation of the user community,’ commented Slack.
The SLA is then based on several
different options for maintenance and support. Frontline support focuses on the replacement of failed hardware – common items that might fail over time such as memory DIMMs or fans and other small components. ‘We will then put together an element of
software support that is break-fix based on the software stack that we deliver on the service,’ said Slack. ‘If a certain piece of software has broken, for whatever reason, then we will dial in and work on that to resolve it for them – and then we tend to add some other options, based on their requirements.’ ‘These requirements might include
remote monitoring at a frequency that has been decided between the client and us – it could be daily, weekly or once a month.
“If you are running a system for years and then it falls over because you have not patched, then lots of questions will be asked of the system administration team”
We will perform a sweep of the service looking at the logs and the hardware to check for things are looking healthy and working optimally,’ said Slack. This information is then gathered and
fed back to the admin team in the form of reports. Reports are provided by OCF at specific, predetermined intervals to state that the system has a clean bill of health – which provides some piece of mind that the service will run as expected. This is packaged into suitable time and maintenance windows, based on the needs of the customer. ‘We offer service credits that can be used in service windows, and this is where the service may be taken offline for a period to do some preventative maintenance,’ explained Slack. ‘It could be that RedHat has released a kernel update that has a security fix in
@scwmagazine |
www.scientific-computing.com
Page 1 |
Page 2 |
Page 3 |
Page 4 |
Page 5 |
Page 6 |
Page 7 |
Page 8 |
Page 9 |
Page 10 |
Page 11 |
Page 12 |
Page 13 |
Page 14 |
Page 15 |
Page 16 |
Page 17 |
Page 18 |
Page 19 |
Page 20 |
Page 21 |
Page 22 |
Page 23 |
Page 24 |
Page 25 |
Page 26 |
Page 27 |
Page 28 |
Page 29 |
Page 30 |
Page 31 |
Page 32