SCW_JUNJUL11

supercomputing software challenges

to encapsulate globally-accessible data and enforce the synchronisation needed to ensure that data is accessed correctly across virtually all hardware platforms. This functionality is especially important as we look forward to new architectures like Intel’s SCC processor where traditionally strong coherence is being relaxed to improve performance. To match MPI’s conservative semantics

with ARMCI’s much more flexible semantics, we created an intermediate layer in the ARMCI-MPI runtime system. Called GMR (Global Memory Regions), this system hides the details of accessing MPI windows and manages accesses to ensure MPI’s semantics are preserved. It also provides translation between ARMCI global addresses and MPI windows and displacements. Ultimately, GMR provides a middle-ground interface that allows us to match up the MPI runtime’s semantics with the ARMCI runtime’s semantics.

Why did you tackle the problem in this way? In several ways, the GA interface is more natural to implement on top of MPI-RMA than ARMCI; the ARMCI runtime system provides more relaxed memory consistency and data access semantics than what is exposed at the GA level. However, GA is a sophisticated software

and an Infiniband cluster with Intel Nehalem CPUs. A few bugs needed to be fixed in two of the MPI implementations, but otherwise we were able to run ARMCI immediately. The native implementation of ARMCI was superior to that of ARMCI-MPI on the three mature platforms (Blue Gene/P, Cray XT5, and Infiniband), as was expected because of the significant investment in performance on these architectures by GA developers. On the much newer Cray XE6, however, ARMCI- MPI is superior to the native implementation, in part because of Cray’s efforts to optimise MPI RMA.

What difficulties are there in adapting software to new architectures? Future extreme-scale platforms will have complex architectures with millions of cores in total, hundreds of cores per node, and a limited amount of memory per core. Adapting software to new architectures is already a time-consuming and cumbersome task. When we add the disruptive changes that upcoming architectures are expected to have, this task becomes even more challenging. For standardised programming models such as MPI, vendors already invest a lot of effort in porting and optimising it for

TRADITIONALLY, SOFTWARE HAS LAGGED BEHIND HARDWARE IN TERMS OF DEVELOPMENT SCHEDULE

system with over a decade of effort invested in its development, so the opportunity for the greatest impact is in providing a portable, high- performance ARMCI runtime system. In many situations, ARMCI’s more relaxed

semantics have allowed developers to internally utilise node-level shared memory and direct access for better performance. This type of optimisation is not easy to accommodate in the MPI-2 RMA model, and we are working with GA and ARMCI developers to strengthen the GA-ARMCI software contract to ensure compatibility and enhance performance with MPI. Several of us are also involved in defining the RMA interface for the next version of the MPI standard, and this work has helped us identify ways in which we can improve performance and usability for future versions of MPI.

How successful was that approach? The metrics for success that were used for this project were the portability and performance of ARMCI and the performance and scalability of NWChem. The architectures we used were Blue Gene/P, Cray XT5, Cray XE6,

36 SCIENTIFIC COMPUTING WORLD

every new architecture, primarily because no one would purchase the machine if MPI did not run well on it. However, expecting researchers or

vendors to provide a native port for every programming model and runtime system on every new architecture is unreasonable. Consequently, when a new architecture is released, an MPI implementation to match that architecture is released simultaneously, whereas it often takes several years for other runtime systems to be ported to those architectures. Allowing models such as GA to work over

MPI instead of their native implementation will allow scientists to run their applications on new machines as soon as they are released. This effectively makes the native port of GA on that architecture a performance optimisation, rather than a show-stopper for new science. This is a subtle yet critical benefit that ARMCI- MPI provides for many science applications such as NWChem, which previously were unable to take advantage of the latest and greatest architectures as soon as those architectures were available in the market.

How does the pace of software advances compare to those in hardware? Traditionally, software has lagged behind hardware in terms of development schedule. The traditional view was that vendors develop hardware, then port the basic runtime systems such as MPI to that platform, and ship the machine. Other pieces of the system software and programming infrastructure were developed through hardware early-access programs once the hardware had been deployed. Eventually, once everything else was in place, scientists would start porting and tuning their applications to the new platform. While this model has generally worked well, the amount of disruption that we are expecting in hardware platforms over the next decade makes such an approach impractical. Without a close tie-in between

hardware developers, system software developers and application developers, changes to hardware architectures might be too disruptive for applications to take advantage of for practical problems. By layering higher-level models such as GA, and consequently various applications using GA, on top of portable programming models such as MPI, we can shave years off the hardware-system software-applications cycle that would not be practical to maintain in the future.

Are there any other challenges you expect to face in the future? Just like hardware architectures, scientific applications are also growing in complexity. The new algorithms and methods that different science domains are planning for the next few years will make scientific simulations more accurate and effective than ever before. However, these benefits come at a cost; no single parallel programming model available today will be able to support all the features required by these applications while taking full advantage of the hardware infrastructure. Applications will need to combine multiple programming models to take advantage of the features and strengths of different models. The challenges we, as a community, face in

achieving broad interoperability are abundant and will require significant effort across many research teams. Given the disruptive technologies we anticipate in the hardware space, the need to rethink our HPC software stack, and the high potential for impact on scientific computing efforts. We are excited about what the next several years of high-end computing will bring.

www.scientific-computing.com

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36 | Page 37 | Page 38 | Page 39 | Page 40 | Page 41 | Page 42 | Page 43 | Page 44 | Page 45 | Page 46 | Page 47 | Page 48 | Page 49 | Page 50 | Page 51 | Page 52