SCW_APRMAY15

high-performance computing

tape – then users have to wait a little bit for the tape to be mounted, the file to be copied to the disk, and they are again put in contact with the disk copy in a way that is transparent to them. Castor was started in 1999/2000. It has given

faithful service for a long time and currently contains an archive of around 100PB of data. But three or four years ago, Cern started a new system, EOS, which is intended to be a disk-only system. Te data will not overflow to tape but instead the system will handle an unprecedented quantity of disk space of around 70PB. Tis disk-only system has some unique features. Te main issue with any large disk-drive

system is that the disks fail. Cern has around 2,000 machines each with a variable number of disks – typically between 24 and 48 – so at least 40,000 drives and therefore, statistically, it can expect a few disks to fail every night. Again, the answer is duplicate, geographically separate copies. ‘Every file is put onto two disks, so a single failure will not impact the user because a second copy of their data is somewhere else. And the system detects the imbalance. What is really amazing to me is that we have two computer centres: one is here in Meyrin; the other one with a more or less equivalent capacity is located in Hungary, at the Wigner Centre near Budapest,’ Lamanna explained. If one of the two fails, the users are directed to the other until the missing files are recreated automatically by the system. He said: ‘To my knowledge this is absolutely unique. Te distance between the two centres is around 1,000km; we have 22ms latency; but this works as a single system.’

Simplifying storage Jim Hughes, chief technology officer at Seagate, believes that there are ways of simplifying the system even further and Seagate’s partnership with the Cern Open Lab project is intended to investigate how the company’s new Kinetic Open Storage system can be adapted to help with Cern’s data storage needs. ‘About two and a half years ago, Seagate

embarked on a path of what can be done to make disk drives simpler and faster and easier and cheaper,’ he said. ‘One of the things we realised was that the interface that people use to talk to disk drives is the same interface that’s been used for the past 40 to 50 years – and it is trying to mimic a system that doesn’t exist anymore. Nowadays storage systems even inside a disk drive are, in essence, virtualised, so the question is: can we make things easier for us – for Seagate to build and therefore cheaper – and easier for customers to use by slicing away layer aſter layer of soſtware so as to make everybody’s solution simpler?’

www.scientific-computing.com l

The first ATLAS Inner Detector End-Cap To build a scale-out storage system like

Cern’s EOS, he continued, ‘you always have an architecture which is a server on the top, some distribution system, then a second server that holds the disk drives. Why can’t the machine that is making the EOS request – instead of making the request to another machine which then makes the request to the disk drives – why can’t the machine that is making the request talk directly to the disk drive? What we are

REAL ESTATE IS

REALLY AT A PREMIUM, SO IT WAS CRITICAL TO GO WITH THE HIGHEST DENSITY

attacking with Kinetic is elimination of the servers from the disk drives. Not cost reduction but elimination.’ By giving the disk drives an API, the

applications can talk directly to the storage over the Ethernet. ‘Now we can have a very inexpensive switch in front of the disk drives, so that the disk drives are just on the network – peers to the compute nodes that are doing the work. When you do a project in Open Lab you have to have some pretty loſty goals, and that is the goal: to eliminate the servers in front of the disk drives,’ he said. Te value of Kinetic is cost savings, he continued, but for him personally, the goal was to create a better way of doing things. Even Tier 2 centres – the ones that are

tasked with analysing rather than storing and distributing the data – can have storage problems. Michelle Butler, in charge of storage technologies at the US National Center for Supercomputing Applications, explained out

@scwmagazine

that the NCSA is a Tier 2 centre. However, the LHC work does not use the main Cray Blue Waters supercomputer. Te University of Illinois has its own

cluster infrastructure at the NCSA, on which individual departments can buy their own compute nodes, again using DDN disks. ‘Tat is the cluster that Atlas uses. One of their applications actually put so much stress on the data disk structure for the general campus cluster that we built our own storage infrastructure for them so they did not impact the campus cluster.’ She said: ‘Tey had their own servers. Tey came in and said “we’re really going to knock that file system” and we didn’t believe them.’ ‘Tere is a very large gap that I see between

the HPC environment and the cluster environment. In future, they are going to marry a lot more,’ she continued. ‘How do we build storage environments using commodity parts that make them just as reliable and fast as the large supercomputing environments? We’ve spent a lot of time learning more about smaller environments, and how to scale those’. Her team had been studying how to make disk environments smaller, cheaper, and use less power. She too sees a need for different soſtware to make them more reliable, while at the same time cutting down on hardware. At every aspect of the LHC data storage

problem, the diagnosis is similar. Te solutions that they advocate may differ, but for Laura Shepard from DDN, Seagate’s Jim Hughes, Cern’s Massimo Lamanna, Triumf’s Reda Tafirout, and NCSA’s Michelle Butler, the outlook for the future is similar: simplify the system; reduce the hardware; make the architecture cleverer; and reduce the costs – and all the while, increase the capacity. l

APRIL/MAY 2015 25

CERN

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40