HPC_YEARBOOK

HPC 2013-14 | Big data

of compute, and then running into the I/O challenges of how those nodes talk to storage has been one of the classic long- term trends within HPC,’ remarked Geoff Noer, senior director of product marketing at Panasas. ‘Most of the traditional HPC markets have found the solution – parallel file systems. But the challenges of parallelism and getting the performance from the data are issues that go beyond analytics or HPC; they’re an enterprise-wide problem.’ Te classic difficulty here comes down to the move from a compute infrastructure that is more monolithic in nature to one that’s highly parallel with a lot of separate compute engines working on the problem, meaning that organisations have to adopt an equivalent storage architecture in order to have any chance of being successful. ‘If you have hundreds or thousands

of compute nodes, it’s a pretty unrealistic expectation that any one storage box with one file head will be able to handle all that concurrent access, so it really is about adopting parallel storage on the backend in order to be able to handle the parallel access on the front end,’ said Noer, adding that this doesn’t mean that one-to-one mapping is necessary, but rather you need to be able to scale the amount of storage performance with compute performance. Panasas believes that any scale-out NAS architecture, such as its ActiveStor 14, will bridge that gap. ActiveStor is a bladed approach that can handle a large number of concurrent access requests but is as simple to use as a single monolithic NAS system. Interestingly, the system uses a Flash technology instead of hard drives for some of the file system metadata. Noer commented that this has made tremendous improvements to the performance of small file access and directory look-ups – making the file system far more responsive. ‘Te very high-capacity component that

comes with the large growth in unstructured data has been well fulfilled through the use of hard drive technologies, but what’s becoming difficult is trying to keep up with the amount of throughput per terabyte,’ said Noer. He added that small file and random operations have been much faster on Flash for many years, but that emerging technologies like PCI Express will allow individual devices to go into the Gb/s range. Tis will make a big difference to the types of throughput storage systems are able to handle. ‘Improvements are being made in the compute infrastructure, but we are a little overdue for a significant increase in

10

throughput,’ Noer remarked. ‘Today that’s solved by putting increasingly large numbers of hard drives at the problem, but the application of Flash will enable people to hit significantly higher amounts of throughput performance in much smaller systems. Tis will unclog performance limitations across the board.’

The role of SSDs Another storage trend, observed by Vipin Chaudhary, president of Scalable Informatics, is the fast growth in the use of solid-state drives (SSDs) to increase the IOPs (input/ output operations per second), to reduce latency, or to reduce power consumption.

such as Flash and SSD are very important developments that improve data access, but Peter Piela, VP of engineering at Terascala, warns that the role of the application is typically overlooked. ‘We need to concentrate on making applications as efficient as possible. If the storage is able to deliver great theoretical performance but is unable to match that promise in real-world situations, then all this great technology is for naught,’ he commented. ‘Storage needs to be far more intelligent so that it is aware of the applications accessing it. Instead of people manually dealing with configuration and optimisation, storage should be able to adapt to optimally deliver performance on the workloads being submitted, or offer guidance to users on how to structure their workloads to best utilise that underlying storage.’ Essentially, storage needs to understand the complexity of HPC workflows. Because the HPC process comprises

“Small file and random operations have been much faster on Flash for many years, but emerging technologies like PCI Express will allow individual devices to go into the Gb/s range”

Te lowering of disk prices along with their increase in size and ease of use is pushing the replacement of tapes in many markets, he said, and as such SSDs are replacing spinning disks, which Chaudhary predicts will eventually only be used for archival or cold storage. Scalable Informatics’ systems are able to effectively utilise both SSDs and spinning disks, and use combinations of them to balance between the performance requirements and cost of the solution. Te company’s soſtware is also tuned to the differing characteristics of both technologies. Conversations surrounding big data

oſten focus on improving density and the ability to store increasingly large amounts of data in a more compact form. Technology

a multi-step workflow that potentially involves different clusters and data devices, users are faced with a production line that requires them to think about how data and applications meet each other. ‘In order to do that, there’s an underlying notion of having to move the data to ensure that it’s in the right place at the right time, whether that’s at the point in the workflow when users are actively running compute, or at the point of archiving for long-term data storage,’ said Piela. ‘Keeping track of all this data is a big challenge, but progress is being made in this area.’

IBM’s Gord Sissons added that

technologies such as GPFS, IBM’s General Parallel File System, are being used to address some of the shortcomings of the Hadoop file system to give customers a platform that can store data more reliably, automate the process of replicating data across physical locations, and automatically migrate infrequently used data to less costly data storage platforms where it can persist for longer periods.

Gaining insight Managing the data is one thing, but gaining meaningful insight from vast stores of structured and unstructured information is an entirely different proposition. Cue the rise of big data analytics. Tis aspect of big data is of particular interest to the accelerator and processor market. Determining patterns and correlation within data falls under the heading of ‘machine learning’. ‘Tere has been a lot of talk about big data in the past two years and the one concrete thing that

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36