SCW_JUNJUL10

HPC PROJECTS: MULTICORE

➤

‘We strongly urge people to use prepackaged routines such as these where other people have done the difficult work of dividing up the tasks in an optimal way,’ says Jones. He adds that simply by linking in the NAG libraries you can get an immediate 2x or 4x speedup, assuming that the computationally intensive parts of a program use the libraries. In NAG’s support program for HECToR

(the UK’s national supercomputing facility), the company is running optimisation projects on a number of codes. One example is CASTEP, a key materials science code. It was enhanced with band-parallelism to allow the code to scale to more than 1,000 cores. Using NAG technology, the speed of CASTEP on a fixed number of cores was improved by four times on the original, representing a potential cost saving of $3m of computing resources over the remainder of the HECToR service.

Implicit parallelism

Text-based languages express parallel code using special notation that creates parallel tasks, but managing these multithreaded applications can be a challenge. The situation is far different with LabView, a graphical programming language from National Instruments. Applications are developed as if drawing a block diagram on paper, and LabView’s dataflow nature means that any time there is a branch in a wire (a parallel sequence on the block diagram), the underlying LabView executable tries to execute in parallel. Automatic multithreading has been natively supported in LabView since 1988, and later versions have refined this process.

By default, LabView automatically multithreads an application into various tasks that are then load-balanced by the OS across available processor cores. The OS scheduler typically does a good job of this, explains Tristan Jones, technical marketing team leader for National Instruments UK and Ireland. In some applications, though, it might be desirable to assign a task to its own dedicated core. Doing so allows the remaining tasks to share the other processor resources and ensures that nothing interferes with the time-critical process. For this, the Timed Loop and Timed Sequence structures in LabView provide a processor input that allows programmers to manually assign available processors to handle the structure’s execution.

32

Figure 2: LabVIEW’s Real-Time Execution Trace Toolkit

In addition, a number of function blocks, such as matrix multiplication, are optimised for parallel operation. LabView also provides a number of utilities to help programmers maximise performance. Figure 2 shows how the Real-Time Execution Trace Toolkit is being used to perform post analysis on the execution of an application running on a real-time multicore system. ‘CPU 0’ has been selected in the ‘highlight CPU mode’ drop down, which means that the parts of the application which do not execute on CPU 0 are greyed-out, thereby allowing the developer to optimise the execution. As with any of the methods discussed in this article, it’s difficult to get specific figures about the performance increase a given technique or programming tool brings, because everything is so application dependent. However, Jones points to an application at the Max Planck Institute in Munich, where researchers applied data parallelism to a LabView program that performs plasma control of the ASDEX tokomak. The program performs computationally intensive matrix operations in parallel on eight CPU cores to maintain a 1ms control loop. Lead researcher Louis Giannone notes: ‘In the first design stage of our control application programmed with LabView, we obtained a 20x processing speedup on an octal-core processor machine over a single-core processor while reaching our 1ms control loop rate requirement.’ Another scientific language that has

added parallelisation tools is Matlab from The Mathworks. That company’s Parallel Computing Toolbox lets you solve computationally and data-intensive problems using Matlab and Simulink on multicore machines. Parallel-processing constructs such as parallel for loops and code blocks, distributed arrays, parallel

SCIENTIFIC COMPUTING WORLD JUNE/JULY 2010

numerical algorithms and message-passing functions let you implement task and data- parallel algorithms at a high level without programming for specific hardware and network architectures. Converting serial Matlab applications to parallel apps requires few code modifications and no programming in a low-level language. By annotating code with keywords such as ‘parfor’ (parallel for loops) and ‘spmd’ (single program multiple data) statements, task and data parallelism offered by various sections of algorithm can be exploited.

Down at the OS level

Some ongoing efforts at improving multicore performance might not bear fruit in the immediate future, but hold extreme long-term promise. Barrelfish, for instance, is a new OS being built from scratch in a collaboration between the ETH Zurich and Microsoft Research, Cambridge. The researchers are exploring how to structure an OS for future multi- and many- core systems. The motivation behind the project is two-fold: first, the rapidly growing number of cores, which leads to scalability challenges; second, the increasing diversity in computer hardware, requiring the OS to manage and exploit heterogeneous hardware resources.

Senior Microsoft researcher Tim Harris explains that Barrelfish involves placing a separate OS kernel on each core, communicating with explicit messaging over hypertransport. ‘This is very much a research OS. We’re using it as a laboratory for prototyping new ideas and concepts. We learn what works, then we talk with the product groups who examine how and when to commercialise the technology.’ He adds that there are fewer opportunities for end users to tune algorithms to a particular machine; this must fall to the job of the OS to identify cores for parallel phases and which jobs run on which cores. This work will have to be done at the OS level. ‘If we’re successful, we won’t need a special HPC OS; we could handle all workloads.’

References

1. Charles E. Leiserson and Ilya B. Mirman, How

to Survive the Multicore Software Revolution (or at Least Survive the Hype), free e-Book at http://

software.intel.com/en-us/articles/e-book-on- multicore-programming/

www.scientific-computing.com Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40 | Page 41 | Page 42 | Page 43 | Page 44 | Page 45 | Page 46 | Page 47 | Page 48 | Page 49 | Page 50 | Page 51 | Page 52 | Page 53 | Page 54 | Page 55 | Page 56 | Page 57 | Page 58 | Page 59 | Page 60