This page contains a Flash digital edition of a book.
GPU PROCESSING
A supercomputer
significant performance degradation it
is necessary to avoid conditionals inside
kernels. Finally, GPUs suffer from latency
in CPU-GPU communication. Unless
chip for every man
the amount of processing done on the
GPU is great enough, it might be faster to
simply perform calculations on the CPU,
but GPU developers are working to make
communications and computations overlap
without much effort by the programmer.
While GPUs (graphics processing units) were initially
designed to accelerate video and gaming, vendors of
Boards for different target markets
Not only are GPUs supplied with many PCs,
science and engineering software are using them to others interested in adding this capability
accelerate their code. Paul Schreier looks at the
can purchase board-level and box products
with GPUs. Nvidia offers several families of
benefits for today and tomorrow products, but they are all based on the same
basic chip, the T10-Series GPU. It features
240 cores for 1 Tflops of single-precision
Although we’re excited about the power (SP) or 78 Gflops of IEEE-compliant double-
of dual- and quad-core CPUs, consider the precision (DP) performance – an important
potential of an integrated circuit with many addition with this chip because DP is vital
hundreds of cores – and these are already in many areas of scientific computing.
in your desktop or laptop PC in the form Its products are branded differently
of a GPU, intended to accelerate graphics depending on the target market: GeForce
processing. In addition, board-level GPU chips and boards are found in consumer
products are becoming available as are products such as notebook and desktop
preconfigured ‘personal supercomputers’. PCs for gaming and video processing; the
These typically use GPUs from Nvidia Quadro family is targeted at professional
and AMD, while Intel has announced its visualisation and graphics; finally and most
intention to join in the action in a different recent are Tesla boards, intended for a cluster
way. The biggest obstacle for the average end or workstation for accelerating numerical
user has been getting software that runs on tasks. As noted, all the families use the same
such devices, but that situation is changing. 10-Series GPU, but board configurations
Leading ISVs (independent software
Comparison of LU decomposition, a common
differ in their amount of memory. Tesla
vendors) have either modified their software matrix operation, performed with and without boards, for instance, have no connectors for
to run on GPUs or will have such software
GPU acceleration (courtesy of the MAGMA
displays and work only in combination with
shortly. The benefits can be astounding:
project at the University of Tennessee, Knoxville,
a CPU to handle parallel tasks.
depending on the actual algorithms being
and the University of California, Berkeley).
Tesla numeric accelerator products come
run, a GPU can increase performance by an in two forms: the Tesla S1070 is a 1U server
average of 10x and sometimes by a factor of tasks where the algorithms remain the same system with four T10 GPUs; next, the Tesla
100x or more. and only the inputs change. The gains are C1060 is a dual-slot PCI Express 2.0 board
achieved by taking an algorithm that requires with one T10 GPU. Further, more than a
Fat vs lean processors massively parallel computations and splitting dozen companies are integrating these board-
To understand how these gains are possible, them up among the GPU’s cores. And while level products into ‘personal supercomputers’.
you must understand the architectural this power is limited to only certain types of
differences between the two classes of CPUs. applications, these are the ones that scientists
Sumit Gupta, senior product manager for and engineers frequently use.
Nvidia’s Tesla GPU Computing Group, As with any technologies, there are
explains that traditional CPUs are ‘fat’ in that downsides. A limitation of GPUs is the
they have large caches and a rich instruction requirement for a high level of parallelism
set and thus are suitable for unpredictable in the application, while another is power
tasks. In contrast, GPUs have hundreds of consumption: GPU boards often consume
‘lean’ processors with reduced instruction sets, much more than 100W. GPUs also place
The Tesla T10 chip from Nvidia provides 240
lots of small distributed memories, but are greater constraints on programmers
processors with double-precision floating-point
designed for predictable, compute-intensive than do CPUs. For instance, to avoid
capabilities in the GPU.
30
SCIENTIFIC COMPUTING WORLD february/march 2009 www.scientific-computing.com
SCWfeb09 pp30-33 GPU.indd 30 4/2/09 14:04:09
Page 1  |  Page 2  |  Page 3  |  Page 4  |  Page 5  |  Page 6  |  Page 7  |  Page 8  |  Page 9  |  Page 10  |  Page 11  |  Page 12  |  Page 13  |  Page 14  |  Page 15  |  Page 16  |  Page 17  |  Page 18  |  Page 19  |  Page 20  |  Page 21  |  Page 22  |  Page 23  |  Page 24  |  Page 25  |  Page 26  |  Page 27  |  Page 28  |  Page 29  |  Page 30  |  Page 31  |  Page 32  |  Page 33  |  Page 34  |  Page 35  |  Page 36  |  Page 37  |  Page 38  |  Page 39  |  Page 40  |  Page 41  |  Page 42  |  Page 43  |  Page 44
Produced with Yudu - www.yudu.com