SCW_DEC10JAN11

HPC PROJECTS: SOFTWARE FOR GPU PROCESSING

with tools such as CUDA or OpenCL implementations in order to preserve legacy investment by the developer.

GPU cluster tools The Accelerator Cluster (AC) at NCSA’s Institute for Advanced Computing ➤

syntax in the language for launching a kernel function for execution on the GPU. ‘While GPUs belong to new class of computation devices that offer upwards of 10x improvement in FLOPS/watt… they can be 100x more difficult to program,’ so says Reservoir Labs, which has developed R-Stream. This is a source-to-source compiler that takes in a sequential routine as input and produces code that has been parallelised and optimised for these new types of processors. It can output optimised code in a variety of formats for downstream processors, for example, OpenMP output and CUDA output. R-Stream performs advanced

transformations from sequential C to achieve the mapping. These include special forms of array expansion (to remove constraints on parallelism), joint scheduling for parallelism and locality, task granularity selection, communications/DMA generation, software pipelining, memory region reshaping and back-end dialect generation. The resulting mapped program is much more than simply parallelised – it represents a detailed choreography of computation and data motion across parallel units and through explicitly-managed memory hierarchies. Similarly, HMPP from Caps includes

code generators designed to extract data parallelism from C and Fortran kernels and translate them into CUDA or OpenCL. HMPP Workbench includes a C and Fortran compiler, back-end code generators and a runtime that seamlessly integrates a programming environment to make use of

22

CUDA/OpenCL tools and drivers. Further, the runtime ensures application deployment on multi-GPU systems. The firm most recently announced support for Microsoft Windows HPC Server 2008 R2 and Visual Studio 2008 to help port apps onto GPU- based architectures.

Another programming tool to look for will be coming from Mercury Computer Systems, which recently introduced the first release of OpenSAL, an open source version of its Scientific Algorithm Library (SAL) for vector math acceleration. Various accelerated versions of OpenSAL for PowerPC and Intel processors, and GPU support is planned for an upcoming release. SAL will preserve the traditional programming model as well as shielding users from exploiting GPU language extensions, enabling the same SAL program to run on the desktop using CPUs and automatically ‘just work’ when run on a system equipped with the latest GPU processors. SAL also supports an integrated environment

Besides applications, development is also taking place on cluster software for GPUs. Several commercial cluster-management tools for GPU-based systems have been available from firms such as Platform Computing, ClusterCorp and Bright Computing. Recently, some free tools – specifically a CUDA wrapper library and CUDA memory tester – have become available and are based on work that is taking place at the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign.

First, though, consider the development platform: the Accelerator Cluster (AC) at NCSA’s Institute for Advanced Computing. This experimental cluster has been extensively used as a test bed for developing and evaluating GPU cluster-specific tools the GPU team feels are missing from Nvidia and cluster software providers. With 40

‘While GPUs belong to new class of computation devices that offer upwards of 10x improvement in FLOPS/watt… they can be 100x more difficult to program’

nodes, it combines both GPU and FGPA technology to explore the potential of these novel architectures to accelerate scientific computing. In the AC, 32 nodes feature two dual-core AMD Opterons, a Nvidia Tesla S1070 with four GT200 GPUs, and a Nallatech H101-PCIX FPGA accelerator; the remain eight nodes each feature two six-core Istanbul CPUs and three AMD/ATI 5870 GPUs.

Members of NCSA's Innovative Systems Laboratory standing in front of the GPU- enabled Accelerator Cluster

SCIENTIFIC COMPUTING WORLD DECEMBER 2010/JANUARY 2011

According to the team, much batch system software today doesn’t prevent users from trying to access the same GPU when sharing the same node. To achieve a truly shared multi-user environment, the NCSA GPU team wrote a library, called CUDA wrapper, which overrides certain CUDA device management calls to ensure that the users see and have access only to the GPUs allocated to them by the batch system.

www.scientific-computing.com

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36