SCW_OCTNOV11

heterogeneous computing

company has overloaded a number of Matlab functions to work on GPUs. They started with the 90 functions that were the easiest to implement, explains Jos Martin, who leads all development for the Parallel Computing Toolbox; now they’re moving on to the more difficult functions and the current count is 150. Martin notes that some of these have taken an enormous amount of work. ‘We’ve learned just how hard it is to write GPU code, and we didn’t realise how difficult it would be even for experienced software engineers. We’ve been writing parallel code for many years, and Cuda takes it to the next level. But now with GPUs we’re dealing with algorithms where literally millions of threads are running simultaneously – and this is a completely different world.’ This overloading approach won’t satisfy

everyone, in particular those who want to take a chunk of code and run all of it on inputs using GPU threads. Here Matlab has arrayfun(), which takes an array and runs every element of it on a GPU. They do so with a just-in-time compiler, which compiles a subset of M-code for a GPU. Then there are those people who want to run Cuda code, but have Matlab managed data transfers and memory. Given a Cuda file you want to execute on the GPU, you

style code to achieve low-level, down- to-the-metal speeds. The V1.0 release also incorporates the popular GFOR loop for running FOR loop iterations simultaneously on a GPU.

AMD’s multilanguage support AMD is a strong supporter of OpenCL, and it recently announced a new collaboration with Multicoreware to deliver a set of OpenCL optimisation tools that exploit the processing capability of its Fusion Accelerated Processing Units (APUs). The overarching goal is to reduce drastically the number of code versions and time spent in performance optimisation efforts on hardware. In the MulticoreWare framework, four

languages (OpenCL, Cuda, Pyon and DSL) are supported, and there is no need to have hand-optimised versions for each platform. With the MulticoreWare Tools Framework, only one OpenCL code is necessary to reach good performance on all supported platforms, whether multicore CPUs, many- core CPUs or FPGAs. The tools consist of the System

Dependency Analyzer to help identify performance bottlenecks, application critical paths and system- wide dependencies in heterogeneous

IT’S A MISNOMER THAT 1000X SPEEDUPS ARE POSSIBLE. THE

BASIS OF COMPARISON WOULD BE VERY OLD, POORLY WRITTEN CODE RUNNING ON OLD HARDWARE

must first compile it to create a PTX file using Nvidia tools; then you can execute that PTX code within Matlab code. Martin warns users not to be misled

by claims of enormous application acceleration with GPUs. ‘It’s a misnomer that 1000x speedups are possible. The basis of comparison would be very old, poorly written code running on old hardware. For FFTs and convolutions, 30x improved performance can be achieved, but in general a maximum speed-up of 15x is what you can expect. Many times users will have to be content with 3x or even 1.5x, which in many cases is still quite attractive.’ A previous product from AccelerEyes

was intended to make GPU functions easier to access from Matlab, and the firm has now brought out Libjacket, which is a similar library available for C, C++, Fortran and Python. It is designed for use with any Cuda application in the same way that native CUBLAS and CUFFT libraries are leveraged. It enables high-level matrix

30 SCIENTIFIC COMPUTING WORLD

systems; Slot Maximizer is a compiler transformation tool that automatically tunes OpenCL kernels for better hardware utilisation; Data Layout is an OpenCL source-to-source transformation tool that automates layout adjustments to aggregate data structures when preparing application kernels for heterogeneous computing; Global Memory for Accelerators is a user- level API that allows application code on CPUs and GPUs to access the same logical data objects by providing applications with a single address space for both types of devices; while Heterogeneous Multicore Task Scheduler is a library for creating task-based multithreaded/multicore applications and implementing dynamic workload balancing across the whole heterogeneous system.

Coming from Microsoft: C++ AMP When it comes to programming languages and environments, you can expect Microsoft to be involved somewhere. Although that

company doesn’t have anything to offer specifically for heterogeneous programming today, the company did map out its intentions this past summer at the AMD Fusion Developers Summit. In a keynote speech, Herb Sutter, Microsoft principal architect for native languages, spoke about C++ AMP (Accelerated Massive Parallelism), emphasising that with it one executable will work on any heterogeneous hardware, and in fact future releases will address multicore and cloud applications. By openly extending the C++ language,

he says, C++ AMP will allow programmers to express parallelism in their applications using familiar, modern C++. It is being built to enable high-performance, parallel computing on a variety of heterogeneous processors. As the company stated on one its blogs, ‘By building on the Windows DirectX platform, our implementation of C++ AMP allows you to target hardware from all major hardware vendors. We expect it will be part of the next Visual C++ compiler and fully integrated into the next release of the Visual Studio experience.’ Several other bloggers asked questions

all of a similar nature: Why does Microsoft need to create something that looks like Cuda and OpenCL? OpenCL so far seems the only language with support from multiple hardware vendors, so why not focus on that? In his keynote, Sutter addressed this point by saying that first, a language must be mainstream and programmable by millions; second, Microsoft wanted to ‘mess with’ the language as little as possible and so want just one general language extension; third, it should be portable so you can mix/ match hardware from any vendor even with one executable; fourth, it should think beyond GPUs and cover the full range of hardware heterogeneity and even migrate to the cloud level. In addition, Microsoft is promoting C++ AMP as an open specification.

The ultimate barrier Are software tools holding us back to adopting GPUs? Certainly to some extent, but the ultimate barrier to GPU programming, according to Nvidia’s Gupta, is inertia. Everything you read makes it sound like GPU programming is very difficult, so most engineers never even give it a try. Yes, there’s a learning curve, he admits, but most people are ultimately successful and achieve considerable performance gains. ‘It’s much easier than you think’ has become his favourite slogan.

www.scientific-computing.com

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40 | Page 41 | Page 42 | Page 43 | Page 44 | Page 45 | Page 46 | Page 47 | Page 48 | Page 49 | Page 50 | Page 51 | Page 52