SCW_FEBMAR13

predictions for 2013 Note that the Intel Xeon Phi performance

figures reported are for the pre-production SE10P that uses 61 cores instead of 60, and a slightly higher clock speed than the production Xeon Phi 5110P, so the equivalent figures for the 5110P will be marginally lower.

Round one: peak double precision (DP) performance If you look at the specification and do the maths you see that these devices have similar peak performance, but achieve it in very different ways. Te K20X edges this round, winning it 10-9.

clock speed (GHz)

5110P 1.053 K20X 0.732

number of

cores 60

896

results per

clock 16 2

peak performance

1.011 Teraflop/s 1.312 Teraflop/s

greater memory size and memory bandwidth of the 5110P edges this round for Intel, 10-9.

Summary of rounds one to three Great theoretical performance is all very well, but that needs to be translated into real performance, and many applications rely on tuned mathematical libraries. DGEMM and SGEMM are double and single precision versions of general matrix multiply functions. Te STREAM benchmark is a synthetic benchmark program that measures sustainable memory bandwidth and the corresponding computation rate for simple vector kernels.

DGEMM performance (Teraflop/s)

5110P 0.877 K20X 1.22

Round two: peak single precision (SP) performance SP performance is also important as not all calculations require the accuracy that DP offers – and both of these devices offer higher SP performance than DP. Another benefit of using SP is that the size of data is halved, so the effective memory and memory bandwidth are doubled. When a compute problem has been optimised it oſten becomes a data management problem – and if SP is used, you have only half as much data to manage. Many application areas can use SP calculations, at least partially, including bioinformatics, electronic design automation (except for SPICE – Simulation Program with Integrated Circuit Emphasis), seismic analysis, defence and weather forecasting. Many supercomputer applications today use mixed precision to derive the benefits of SP where possible, but also to maintain the accuracy of their results. Te K20X wins this round comfortably, but without a knock-down, so once again the score is 10-9.

clock speed (GHz)

5110P 1.053 K20X 0.732

number of

cores 60

2688

results per

clock 32 2

peak performance

2.022 Teraflop/s 3.935 Teraflop/s

SGEMM performance (Teraflop/s)

1.796 2.9

STREAM Triad (GB/s)

171 176

It is no surprise that the K20X wins on the

matrix multiply tests, but the greater memory bandwidth of the 5110P failed to deliver a winning STREAM score – so perhaps Intel is fortunate to be only marginally behind aſter the first three rounds.

Round four: power consumption Te power consumption of these two devices is close (Intel 225 watts, Nvidia 235 watts), so this round is a draw, 10-10. Indeed, it is likely that the design targets of both devices were to deliver one teraflop of peak performance inside a 250 watt power envelope.

Round five: price Te RRP of the 5110P is $2,649, while Nvidia says the price for the K20 family is up to its channel, but it expects it to be in the range $4,000 to $4,500, so Intel wins this round 10-9. However, it is not as simple as that. Nvidia’s

Round three: memory and bandwidth

Memory band- width (GB/s)

5110P K20X

320 250

Memory size (GB)

8 6

As was noted above, data management is crucial to supercomputer applications, so the

www.scientific-computing.com

price was set in a market where it had little effective competition, and there is a mass of Cuda soſtware already developed that runs on the device. As the incumbent vendor with a mature product, Nvidia is able to charge a premium. Te company also leverages its GPU technology in commodity graphics cards, so it has the volume to support more competitive prices if its position in HPC is threatened by the Xeon Phi. It is anticipated that Nvidia will drop its prices as Xeon Phi grows market share. It is also worth noting that the price of

high-end supercomputers is oſten very flexible. Many sales that achieve a prominent place in the Top500 or Green500 lists are seen as being strategic by the vendors, so they will bend over backwards to win them.

Round six: programming approaches Tis is perhaps the most important round – as there is no point in having a very powerful computer if no programs are available for it – but it is also the most complex one to call. Nvidia GPUs are generally programmed

using Cuda, which provides extensions to the standard C, C++ and Fortran languages to support the programming of an accelerator that exploits a high degree of parallelism. Cuda was introduced in 2006 and is attracting 2,000 downloads a day, so is a pervasive but proprietary approach, although it could be claimed that Cuda is a de facto standard. Intel has a strong family of soſtware

development tools used to program its mainstream Xeon product line. Tese tools can also be used to quickly build applications for the Xeon Phi. Te big question is how efficiently will most applications run on the Xeon Phi without being re-architected for an accelerator model? Te answer is that a few applications run very efficiently with relatively little optimisation work, but efficient execution for most applications running on the Phi would require significant effort – a similar amount of effort as is required to achieve efficient execution on a GPU. If retargeting

BOTH OF THESE

DEVICES OFFER HIGHER SP PERFORMANCE THAN DP

applications at the Xeon Phi using Intel’s tools was easy there would be a large portfolio of Phi-optimised applications out there – but this is not the case, which suggests that it is not quite as easy as Intel has been suggesting. So the key issues for this round appear to be

the de facto standard of Cuda for Nvidia, and the widely used Intel tools that support the OpenMP standard and have been retargeted for the Phi. Listening to experienced HPC industry professionals, it appears to be almost a religious debate rather than a reasoned technical discussion. Time will tell. Intel makes a big thing of describing the

Xeon Phi as a coprocessor, not an accelerator. But both K20X and Phi feature as PCIe cards in systems that use standard x86 processors, and most developers will take existing applications and offload the computationally intensive parts to the accelerator/coprocessor, so it may be a moot point. Te Xeon Phi has three modes of operation

– only two of which are available on Nvidia GPUs. Te common approach is offload.

FEBRUARY/MARCH 2013 25

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40