data intensive computing data intensive
deep-sequencing information across thousands of CPUs. ‘Systems biology demands massive integration of extremely large datasets. Large shared memory should enable us to handle such data at a much higher speed and with a greater focus on the biological questions at hand,’ says Peter Rigby, chief executive professor at ICR.
The do-it-yourself approach Rather than purchase a preconfigured machine with large memory capability, the Edinburgh Parallel Computing Centre (EPCC) configured its own and had it built by a systems house, Mini-ITX. For now, it’s known as the DIR (Data Intensive Research) machine. According to Adam Carter, an applications consultant at EPCC, this machine is different from typical HPC machines or departmental clusters in that it is designed to be more ‘Amdahl-balanced’. This term is used to describe machines
that can perform input/output operations at a rate close to the one at which arithmetic operations can be performed: ‘Because data I/O can be a bottleneck, the EPCC elected to go with Atom processors to keep costs down, and why choose a faster CPU if it’s just sitting there waiting for data? The Atom is slowish, but it’s as fast as it needs to be. We’d rather spend our money on disks than CPUs which are disproportionally powerful for the problems we want to solve.’ The DIR has 120 nodes, each with one Atom, one Nvidia GPU, 256 Gbytes of solid state disk and three rotating disks for a total of 720 Tbytes of fast disk storage. The system doesn’t use virtualisation per se, but instead there is a copy of the software stack on each node for processing. The machine should be well suited to
scientific cases that use either large amounts of data (several tens of Tbytes) in a single simulation or analysis run, or applications whose speed is constrained by reading from and writing to disk – a traditional bottleneck for high-performance computers with fast processors.
Software instead of large memory Large flat memory space isn’t the only way to attack the big data problem, argues Sumit Gupta, the Tesla product manager at Nvidia, who adds that engineers can apply software techniques to speed the process of evaluating very large amounts of data. He cites the Weather Research and Forecasting Model and its use at the National Center for Atmospheric Research, where weather models are moving from the terascale (one
www.scientific-computing.com
work is performed in a small portion of the code, which when run on a GPU results in performance increases in the range of 60 times to 80 times.
Farming the data out When it comes to working with big data, some people don’t focus exclusively on individual computer systems. ‘Data must be put into the hands of experts, and it’s quite normal that many partners around the world are involved,’ comments Don Petravick, senior project manager for the Dark Energy Survey at the National Center for Supercomputing Applications (NCSA). Rather than build its own dedicated
research network, NCSA uses TeraGrid, which describes itself as ‘an open scientific discovery infrastructure combining leadership class resources at 11 partner sites to create an integrated, persistent computational resource.’ Using high-performance network connections, TeraGrid resources include more than 2.5 Pflops of computing capability and more than 50 Pbytes of online and archival data storage. Researchers can also access more than 100 discipline-specific databases. With this combination of resources, the TeraGrid is claimed to be the world’s largest, most comprehensively distributed cyber infrastructure for open scientific research. Taking advantage of this framework, the
The SGI Altix UV system supports up to 16 Tbytes of global shared memory in a single system
trillion flops) to petascale class applications and are reaching a tipping point where adding more CPUs is no longer effective for improving speed. At NCAR, porting their Microphysics routine to Nvidia Cuda code brought a 10 times improvement in performance – which is especially significant given that while Microphysics makes up only one per cent of the model’s source code, converting it to Cuda resulted in a 20 per cent speed improvement for the overall model. Another beneficiary of GPUs for large
data problems is Geomage, a company which performs seismic surveys to locate reservoirs of carbon-based fuels. A large 3D survey across 1,600 square
kilometers can result of anywhere from one to 50 Tbytes of data. Here, too, most of the
Dark Energy Survey is intended to probe into the origin of the accelerating universe and help uncover the nature of dark energy by measuring the 14-billion-year history of cosmic expansion with high precision. More than 120 scientists from 23 institutions in the United States, Brazil, Spain, Germany and the UK are working on the project. This collaboration is building an extremely sensitive 570-megapixel digital camera high in the Chilean Andes. Every time this telescope points into the sky, it examines an area the size of the moon; during the five years of the survey, the result will be 100 Tbytes of data. ‘We collect the data, make a safe copy and then farm out the data to the TeraGrid,’ explains Petravick, who concludes that ‘big national/international infrastructures such as this are essential to accomplishing the tasks we’ve set out for ourselves.’
References 1. T. Hey, S. Tansley and K. Tolle (eds), The Fourth Paradigm: Data-Intensive Scientific Discovery, Microsoft Press, version 1.1, October 2009, 252 pgs. Available for free downloading at http://research.microsoft. com/en-us/collaboration/fourthparadigm
APRIL/MAY 2011 27
Page 1 |
Page 2 |
Page 3 |
Page 4 |
Page 5 |
Page 6 |
Page 7 |
Page 8 |
Page 9 |
Page 10 |
Page 11 |
Page 12 |
Page 13 |
Page 14 |
Page 15 |
Page 16 |
Page 17 |
Page 18 |
Page 19 |
Page 20 |
Page 21 |
Page 22 |
Page 23 |
Page 24 |
Page 25 |
Page 26 |
Page 27 |
Page 28 |
Page 29 |
Page 30 |
Page 31 |
Page 32 |
Page 33 |
Page 34 |
Page 35 |
Page 36 |
Page 37 |
Page 38 |
Page 39 |
Page 40 |
Page 41 |
Page 42 |
Page 43 |
Page 44 |
Page 45 |
Page 46 |
Page 47 |
Page 48