This page contains a Flash digital edition of a book.
analysis and opinion Challenges and solutions in debugging code on the Intel Xeon Phi coprocessor


Chris Gottbrath, principal product manager at Rogue Wave Software,


presents a case study on debugging with TotalView on Beacon


advantage of many-core processor technology. Because the Intel Xeon Phi coprocessor shares many architectural features and much of the development tool chain with multi-core Intel Xeon processors, it is fairly simple to port a program to the new coprocessor. However, taking full advantage of the new power offered by the Intel Xeon Phi coprocessor requires expressing a level of parallelism that demands re-thinking of algorithms. Tis is the exact challenge that the National Institute for Computational Sciences (NICS) at the University of Tennessee, USA, is working towards overcoming with its Beacon Project. Te project is an research initiative funded


W


by the US National Science Foundation and the University of Tennessee to explore the impact of emerging computer architectures on computational science and engineering. Currently, nine teams associated with the Beacon Project are exploring the impact of the Intel Xeon Phi coprocessor on scientific codes and libraries, with approximately two dozen more open-call applicants about to begin work. Some of the programs that are being optimised as part of the project include magneto- hydrodynamics, plasma physics, cosmology, chemistry, quantum chromodynamics, and bioinformatics applications. Te Beacon system, which received the


number-one ranking on the November 2012 Green500 list, offers access to 48 compute nodes and six I/O nodes joined by FDR InfiniBand interconnect, providing 56 Gb/s of bi-directional bandwidth. Each compute node is equipped with two Intel Xeon E5-2670 processors, four Intel Xeon Phi coprocessors 5110P, 256 GB of RAM, and 960 GB of SSD storage. Beacon provides 768 conventional cores and 11,520 accelerator cores – meaning the system offers 210 Tflops of combined


22 SCIENTIFIC COMPUTING WORLD


ith the launch of the Intel Xeon Phi coprocessor, developers have been presented with many exciting opportunities to take


computational performance, 12 TB of system memory, 1.5 TB of coprocessor memory, and more than 73 TB of SSD storage in aggregate. Te typical strategy for developers


participating in the Beacon Project is to first port and then optimise code for the Intel Xeon Phi coprocessor. An example of this is the Boltzmann-BGK Solver, which uses a kinetic model for computational fluid dynamics. With hundreds of thousands of state variables that need to be solved at each grid point, the BGK-model Boltzmann equation can directly benefit from vectorisation and acceleration on the Intel Xeon Phi coprocessor. As part of its optimisation process for this solver, the team used the early-access version of TotalView to debug its native Intel Xeon Phi code and drill down to the thread level in order to debug issues that came up during porting. Te team


AFTER PORTING THE CODE, THE TEAM WAS ABLE TO QUICKLY IDENTIFY AND CORRECT INITIAL PROBLEMS


tracked down a subtle problem and discovered that the answers were wrong in the OpenMP version. Using TotalView, the team analysed the operations occurring on each OpenMP thread. Being able to compare the data from each thread with the ultimate result clarified what was happening with the code, and allowed the team to work with the vendor to get the problem resolved. Aſter porting the code, the team was able to quickly identify and correct initial performance problems, enabling positive speed-up on the Intel Xeon Phi coprocessor relative to the Intel Xeon processor. Another example of a successful port to


the Intel Xeon Phi coprocessor is the Gyro tokamak plasma simulation code from General


Atomics. Gyro numerically simulates tokamak plasma microturbulence and computes the turbulent radial transport of particles and energy in tokamak plasmas, solving 5-D coupled time-dependent nonlinear gyrokinetic Maxwell equations with gyrokinetic ions and electrons. Te team porting the Gyro tokamak plasma simulation code faced a problem, which originated from a different source than that which was initially thought. Te code was first ported by adding the ‘-mmic’ compiler flag and was structured around using MPI to express multi-node parallelism and OpenMP for expressing parallelism across a number of threads in order to take advantage of multiple core compute nodes. Using TotalView, the team tracked down the


issue that was causing some runs to complete and others to fail in a strange way. In the many-core environment of the Intel Xeon Phi coprocessor, the number of threads created per MPI process was increased from single digits up to 50 or 100. Te work distribution scheme had an assumption that was no longer valid, and therefore work was not being distributed to most of the threads. Had this not been fixed, the performance would have been limited as many cores would have been underutilised. Moreover, in this case, the mistake also had a cascading effect that ultimately caused the MPI processes to run out of memory. Fixing the issue also made the program more balanced, which resulted in better performance. Te Beacon Project has experienced


initial success with porting and optimising code for the Intel Xeon Phi coprocessor. Te optimisation process exposed the need for advanced tools that help scientists debug and optimise parallel applications so that they can support hundreds of threads per node. Applications need to have large numbers of threads, or they will be unable to use more than a fraction of the Intel Xeon Phi coprocessor’s power. Te biggest challenge is that there are still vast numbers of MPI-based applications that will need to be ported to MPI/OpenMP hybrid parallelism. When this is undertaken, the structure of


the code is changed in fundamental ways, and these changes oſten break the code. TotalView has proved critical in alleviating these growing pains by making it easier and quicker to analyse and resolve defects uncovered or created the during porting process.


@scwmagazine l www.scientific-computing.com


Page 1  |  Page 2  |  Page 3  |  Page 4  |  Page 5  |  Page 6  |  Page 7  |  Page 8  |  Page 9  |  Page 10  |  Page 11  |  Page 12  |  Page 13  |  Page 14  |  Page 15  |  Page 16  |  Page 17  |  Page 18  |  Page 19  |  Page 20  |  Page 21  |  Page 22  |  Page 23  |  Page 24  |  Page 25  |  Page 26  |  Page 27  |  Page 28  |  Page 29  |  Page 30  |  Page 31  |  Page 32  |  Page 33  |  Page 34  |  Page 35  |  Page 36  |  Page 37  |  Page 38  |  Page 39  |  Page 40  |  Page 41  |  Page 42  |  Page 43  |  Page 44  |  Page 45  |  Page 46  |  Page 47  |  Page 48  |  Page 49  |  Page 50  |  Page 51  |  Page 52