SCW_FEBMAR13

The processing

In the battle for processor dominance, who will win out? John Barr weighs up the contenders

M Term Kiloflop/s

Megaflop/s Gigaflop/s Teraflop/s Petaflop/s Exaflop/s

ainstream processors are designed to be general purpose, all-rounders. Te number crunching needs of scientists and engineers have

oſten gone beyond what can be delivered by standard systems. Over the years, many special purpose platforms have been used to provide the compute cycles required. Some of these have been entire systems, such as Cray parallel, vector supercomputers that delivered very high performance, at a very high price. An alternative solution has been to boost the compute capability of a standard system by adding a coprocessor or accelerator.

Number of floating point operations per second

103 106 109

1012 1015 1018

In the 1980s, array processors were the

size of a fridge, cost tens of thousands of dollars and delivered a mighty 12 megaflops. Array processors provided a low-cost route to boost the floating point performance of minicomputers of the day. Along the way, other devices have found limited traction, including Intel’s i860 processor, Digital Signal Processors (DSP) and Field Programmable Gate Arrays (FPGA). Tis philosophy holds true today, but the format, performance and cost have radically changed. Today’s devices are on PCIe cards the size of a book, offer a peak performance in excess of one teraflop, and cost only a few thousand dollars – note that the first system to deliver one teraflop was built by Intel’s defunct Supercomputing Systems Division in 1996, cost $55 million

24 SCIENTIFIC COMPUTING WORLD

and filled 76 cabinets with 9,072 Pentium pro processors. Te problem today is further complicated

by the high power consumption of electrical components. Te peak performance of the fastest supercomputer is expected to advance from one petaflop to one exaflop during this decade, but the power consumed by the system must be constrained if system operation is to be affordable. An improvement in compute power delivered per watt consumed of around a factor of 100 is required if exascale systems are to be feasible. One approach to delivering more compute power per watt is to use a very large number of relatively low-performance, low-power-consuming processors that can deliver better aggregate performance per watt than a small number of high-performance, high-power-consumption processors. Te most widely used compute accelerator

today is Nvidia’s family of GPGPUs (General Purpose Graphical Processing Unit). Te company has recently launched a new family of GPUs, the Tesla K20 and high-end K20X. Intel has also joined the battle with the launch of its Xeon Phi family. Tough there are other options, the vast majority of compute accelerators sold during 2013 will include Nvidia K20/K20X or Intel Xeon Phi components. Te market opportunity for K20 and Phi

is more than just high-end supercomputers, also covering departmental systems and HPC workstations. Te drive for their adoption is the need for more compute performance while consuming less power. Te barrier to much wider adoption is soſtware – both the complexity of programming these devices, and the lack of availability of a broad portfolio of applications. At the very high end there are more people with the right skills, and people

willing to put up with programming pain – while in the mid-range and on the desktop, people just want to get their job done. Tey don’t care how many cores it has, or what the underlying architecture is, they just want it to work – and fast.

The big fight in 2013 In the blue corner, weighing in at 1.011 teraflops and boasting 60 Pentium cores with a 512-bit wide SIMD unit is the Intel Xeon Phi 5110P, whose father, Xeon, powers many of today’s supercomputers. While in the green corner, weighing in at 1.31 teraflops and powered by 2,688 single precision and 896 double precision cores is Nvidia’s Tesla K20X, the next generation of the most popular accelerator used in supercomputers today. Te table below shows the technical

details, but does not, perhaps, tell the whole story, which will be explored in eight gruelling rounds.

Intel

Xeon Phi 5110P

Peak double precision (Teraflop/s)

Peak single precision (Teraflop/s)

Clock speed (GHz) SP Cores

SP results per clock per core DP Cores

DP results per clock per core Memory bandwidth (GB/s) Memory size (GB)

Power consumption (Watts)

DGEMM performance (Teraflop/s)

SGEMM performance (Teraflop/s)

STREAM Triad (GB/s) Price

1.011 2.022

1.053 60 32 60 16

320 8

225 0.877 1.796 171

Nvidia Tesla K20X

1.31 3.95

0.732 2688 2

896 2

250 6

235 1.22

2.9 176

$2,649 $4,000 - $4,500

www.scientific-computing.com

challenge

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40