SCW_JUNJUL10

HPC PROJECTS: MULTICORE

Fig 1: Intel’s Parallel Amplifier improves parallel performance by reducing the time that threads have to wait on locks

➤

code; and Parallel Advisor Lite identifies areas that can benefit from parallelism. Intel Threading Building Blocks provide an alternative to the OpenMP API, which has traditionally been a leader in the HPC area. Open MP was originally written for code for vector supercomputers and for improving performance in loops, but, according to Reinders, it’s starting to show its age. For a more flexible scheme, TBB is a runtime-based parallel programming model for C++ code that has been placed in the public domain. It is said to help programmers leverage multicore programming without having to be a threading expert, because you specify logical parallelism instead of threads. Breaking a program into separate function blocks and assigning a separate thread to each one is a solution that typically doesn’t scale well, whereas TBB provides an abstraction layer for the programmer, allowing logical sequences of operations to be treated as tasks that are allocated to individual cores dynamically by the library’s runtime engine. Another tool from Intel is the Thread Checker, something Reinders labels an ‘exciting’ tool because it can directly detect possible causes for a deadlock or a data race condition. Data races are ‘horrible’ to debug, because they’re similar to an intermittent fault in a hardware design, and weeks can be spent searching for a data race condition. They’ve been a major barrier to the creation of good parallel code. Two more tools that should become available this year are the result of acquisitions from Intel. The first is Ct technology, which Intel acquired with RapidMind. Ct is forward-scaling; it lets a single-source application work consistently on multiple multicore and manycore processors with different architectures, instruction sets, cache architectures, core counts, and vector widths without requiring developers

30

to rewrite programs. Ct technology is built off the C++ language to provide a simple portable data parallel programming API that results in simpler and more maintainable code. Finally, Ct technology prevents parallel programming bugs such as data races and deadlocks by design. It guards against these problems by prompting developers to specify computations in terms of composable, deterministic patterns close to the mathematical form of their problem, not in terms of low-level parallel computation mechanisms. Ct then automatically maps the high-level, deterministic specification of the computation onto an efficient implementation, eliminating the risk of race conditions and non-determinism.

‘Serial debugging and regression testing remain unchanged’

Another upcoming product for node-level parallelism is Cilk++, based on technology acquired from Cilk Arts last year. This extension to C++ is designed to provide a simple, well-structured model that makes development, verification and analysis easy. With it, programmers typically don’t need to restructure programs significantly in order to add parallelism. With the Intel Cilk++ SDK, programmers insert three Cilk++ keywords into a serial C++ application to expose parallelism. The resulting Cilk++ source retains the serial semantics of the original C++ code. Consequently, programmers can still develop their software in the familiar serial domain using their existing tools. Serial debugging and regression testing remain unchanged.

Similar help is coming from other companies. For instance, FASThread is an interactive compiler add-on integrated into an IDE, currently Visual Studio (with plans for an Eclipse version and also a command- line version). At this point, Visual Studio C/C++ is the only supported complier with Intel C/C++ and GNU C/C++ in the works. This tool was developed by Nema Labs, a spin-off from research led by Professor Per Stenstrom at Chalmers University of Technology in Gothenburg, Sweden. Stenstrom states that the long-term vision of the technology is to shield platform-

SCIENTIFIC COMPUTING WORLD JUNE/JULY 2010

dependent optimisations from the software developer so that he can focus on software innovation while the FASThread technology unlocks the performance of present and future multicore platforms, whether homogeneous or heterogeneous. The capabilities added are twofold. FASThread first guides the developer to clean up a program from dependences so that it can be automatically parallelised and tested. This includes, for instance, removing all data dependences that could cause data races at run-time. FASThread then parallelises the sequential source code and generates a semantically equivalent parallel version of the program using a particular parallelisation API supported by the compiler for the target system. This parallelised version of the original code is directly fed into the C/C++ compiler for the target system to generate an optimised and parallelised binary. In tests run on a set of nine applications from various scientific areas, the applications run two times faster on a quad-core machine after only a 15-minute session with FASThread. For most of them, the developer had to interact with the tool to remove dependencies from the code before parallelism could be unlocked. Spending a total of eight hours, it was possible to increase performance two to five times on an eight- core machine.

A decade of experience

‘When it comes to parallel programming, it’s easy to do something that looks right, but it’s difficult to be sure it is right and will do the same thing under all conditions,’ says Andrew Jones, VP of HPC Business for the Numerical Algorithms Group. That company supplies both the NAG Parallel Library and the NAG Library for SMP and multicore. The latter – which is nothing new at all and has been available for a decade – is said to be the largest commercial numerical algorithm library developed to harness the performance gains from the shared memory parallelism of Symmetric Multi-Processors (SMP) and multicore processors. It has more than 1,600 algorithms, with many specifically tuned to run significantly faster on multisocket and multicore systems. The NAG Parallel Library has been specifically developed to enable applications to take advantage of distributed memory parallel computers with ease. The library components hide the message passing details to maximise their modularity.

www.scientific-computing.com

➤ Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40 | Page 41 | Page 42 | Page 43 | Page 44 | Page 45 | Page 46 | Page 47 | Page 48 | Page 49 | Page 50 | Page 51 | Page 52 | Page 53 | Page 54 | Page 55 | Page 56 | Page 57 | Page 58 | Page 59 | Page 60