SCW_JUNJUL11

supercomputing software challenges

Lars Koesterke, a performance evaluation and optimisation expert at the Texas Advanced Computing Center (TACC), USA

the code by a factor of 100,000, they are able to tackle questions that would have taken exponentially longer. And improving codes to this degree enables scientists to take their research in completely new directions they would have never thought possible before. Writing good code is actually a lot harder than most people realise. One issue – and this is something that happens a lot – is that scientists generally know a little bit about coding and how to program, so they will write the code themselves and then run it. Usually someone with far more knowledge will then take a look at that code and improve it. This isn’t always the best approach. People working in biology, for example, use languages that are much easier to learn and understand, and while the code will get the job done, the speed is an order of magnitude slower than if it were written in a high-performance-appropriate language. Writing code on a workstation is

I

already complicated, but if you then use a supercomputer there are suddenly many

f you improve the speed of a code by a factor of two, scientists wait half the time to perform an execution or solve a problem, but if you improve the speed of

more levels of hardware and software that you have to understand before you can begin. For example, you start writing a code on your workstation using language ‘x’ and if you then use a supercomputer there are other

WE HAVE TO TRAIN AND

EDUCATE USERS, AND ENCOURAGE THEM TO GO THROUGH THE STEPS OF LEARNING THESE LANGUAGES AND TOOLS

languages on top of ‘x’ that you have to learn in order to have communication between the individual parts of the system. These are very complicated languages – conceptually, but also with regards to codes and debugging – and that’s where users struggle the most. The solution is that we have to train

and educate users, and encourage them to go through the steps of learning these languages and tools. We do this by showing them that, if they do things properly, the code will be much faster and will aid them to such a degree that it will change the

nature of the work they are able to do. This is not only true for beginners; it’s also true for those who have software written for a specifi c generation of supercomputers. As soon as the next level of computing arrives it brings new challenges and so users and programmers have to keep up with new technology and learn the latest language features and concepts in order to exploit these supercomputers. We have only just broken through the

petascale barrier, and so many things have to change before we reach the next generation of supercomputers. One very promising avenue is accelerators. To get to exascale there are a million challenges we have to face fi rst, but we can already see that accelerators will boost performance. Adding them to the individual components of a supercomputer, however, brings another level of complexity and it will be a major challenge to convert all our codes.

Computer scientists Jeff Hammond, James Dinan and Pavan Balaji, from the Argonne National Laboratory, USA

What challenges did porting Global Arrays to new supercomputing architectures present? Global Arrays (GA) is a sophisticated programming model, and it has unique requirements in terms of the functionality that must be provided by lower levels of the software stack. For this reason, GA has its own runtime system, called ARMCI (Aggregate Remote Memory Copy Interface). ARMCI provides GA with a global address space view of distributed shared data, as well as the one-sided communication capabilities needed to support GA’s high- level, distributed shared array programming

www.scientific-computing.com

interface. When a programmer accesses a section of a global array in an application, several noncontiguous data transfers often can result. Effi ciently supporting this type of communication is one of ARMCI’s core responsibilities. ARMCI itself is usually implemented

for each target platform and utilises platform-specifi c characteristics, such as RDMA (remote direct memory access), for performance. GA developers and vendors have provided excellent native implementations of ARMCI for Infi niband, Cray XT systems, and several others. This native implementation strategy can yield the best performance; however, ARMCI is a complex runtime system that requires signifi cant expertise to maintain and extend to new systems. Our goal in this project was to utilise our experience with MPI to create a portable implementation

of ARMCI (and, by extension, GA and the NWChem computational chemistry suite) that utilises MPI’s one-sided communication functionality. This portable ARMCI-MPI implementation has eluded researchers for the past decade because of the complexity involved in matching ARMCI’s global address space model with MPI’s communication window model.

What approach did you take? MPI’s remote memory access (RMA) interface provides the ability to perform the asynchronous one-sided communication needed by GA. In designing this interface, the MPI Forum sought to create a universally- portable system that would work on systems where processor-processor and processor-network data coherence is weak or nonexistent. Thus, MPI RMA introduces a shared data window construct that is used

JUNE/JULY 2011 35

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40 | Page 41 | Page 42 | Page 43 | Page 44 | Page 45 | Page 46 | Page 47 | Page 48 | Page 49 | Page 50 | Page 51 | Page 52