SCW_JUNJUL12

Each EnergyCore chip on Calxeda’s Quad-Node Reference Design implements a 5W server

HPC users, they showed significant interest. Financial applications, for instance, don’t involve large parallel jobs – such as is the case in much science – but rather lots of small jobs that do millions of small calculations. Tis first effort is not intended for big parallel MPI jobs and Top500-type systems, but it is ideal for embarrassingly parallel workloads such as genetics or for security agencies looking for specific words in text strings. Tere will eventually be a mix of options concerning the

FOR INTEL, THE WORLD IS AN X86 PROBLEM AND EVERY PROBLEM HAS AN X86 SOLUTION; FOR NVIDIA, EVERY PROBLEM HAS A GPU SOLUTION

CPU and the accelerator and users will have to examine their applications closely when making the choice. Tere will be purpose-build servers for different workloads.’ Calxeda has attracted the attention of

other companies active in the server space and it now has business agreements with the hardware manufacturer and systems integrator Boston, along with companies supplying support soſtware such as MapR Technologies and uCirrus, plus the data-storage company ScaleIO. More specifically, the Boston Viridis Project packs 48 nodes in a 2U enclosure, so this platform can provide 900 servers in a standard 42U rack and deliver up to 10

38 SCIENTIFIC COMPUTING WORLD

times the performance per Watt over existing processor technologies. While the ARM is attractive on the power

front, what we really need for HPC, says Nvidia’s Gupta, is a 64-bit version and ECC memory protection. ARM has announced it is working on a 64-bit version based on the ARMv8 architecture and expects volume production by 2014. But there are already systems combining the ARM and GPUs, such as with chips from Nvidia, whose Tegra 3 incorporates a quad-core ARM Cortex-A9 CPU with a GeForce GPU. Tis device is currently being used in mobile phones and tablet computers. To aid developers, Nvidia also offers the CARMA (Cuda on ARM) development kit. It appears that in the future we can certainly

expect GPUs working as accelerators for ARM processors in HPC systems. We’re getting a glimpse of what such a supercomputer could look like thanks to a project at the Barcelona Supercomputer Center, which is developing a system based on the Tegra 3. Tis system is now going into the prototype-building stage and the operators hope to start running benchmarks this summer. It should be interesting to follow exactly what this machine accomplishes when complete.

Making petaflops affordable A device that got the scientific community quite excited in GPU computing was Nvidia’s Fermi chip, with its ECC and double- precision processing. Now the company is introducing the Kepler chip, which Gupta calls ‘a revolution that will change the face of HPC and put petaflop power in the budget of

mid-sized enterprises.’ Available now on the Tesla K10 board, the K10 chip is a single- precision accelerator that offers three times the performance per Watt of the Fermi. Tis comes primarily from the fact that

while the Fermi chip has 32 SMs (streaming multiprocessors), the Kepler benefits from this building block being redesigned and now has 192 such cores. Gupta says this means you can get 1 petaflop in 10 racks at 400 kW, whereas with an Intel Sandy Bridge chip you need 100 racks and 2 to 3MW. Te most exciting news for the HPC

community, though, will come later this year with the K20 chip. Not only does it have all the features of the K10, it performs double- precision math and features several new architectural benefits not available in the K10. Te first of them is Hyper-Q. In the Fermi, only one MPI task can run at a time, whereas the Kepler chip can run 32 simultaneous MPI tasks. Tis ability greatly increases GPU utilisation and cuts CPU idle time. Next is dynamic parallelism. With the Fermi, the CPU sends jobs to the GPU; with the Kepler, the GPU adapts to data and dynamically launches new threads on its own. Te CPU is no longer in charge of all actions, which makes the GPU more autonomous and also makes programming GPUs easier. ‘Because of dynamic parallelism, Kepler can now accelerate almost any application,’ claims Gupta. On the soſtware front, Nvidia has

contributed the Cuda compiler to the LLVM open source project. Tis means that programmers and developers can build front-ends for Java, Python, R and domain- specific languages, and they can target other

www.scientific-computing.com

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40 | Page 41 | Page 42 | Page 43 | Page 44 | Page 45 | Page 46 | Page 47 | Page 48 | Page 49 | Page 50 | Page 51 | Page 52