HPC_YEARBOOK

HPC 2013-14 | Tuning soſtware for HPC

accelerators will be hidden from the developer. Libraries of primitives are typically implemented using C++ templates, and are developed primarily by the hardware vendors, such as Intel Treading Building Blocks, Nvidia Trust, and AMD Bolt. Tere are also solutions from independent companies such as ArrayFire from AccelerEyes.

Special languages While the two above mentioned approaches, at least theoretically, allowed people to code the whole algorithm without using accelerator- specific programming languages, the ‘special language’ concept implies a vast utilisation of extensions of popular languages created for massively parallel architectures. Tese extensions usually simplify the development or provide additional tools to improve the computing speed. Language extensions make it possible to code any algorithm, but the reverse side of the coin is that a version of a computational module specifically developed for a heterogeneous system needs to be optimised substantially and in the future will have to be tailored to new accelerators which appear every one or two years. Te process of standardisation of such

language extensions started a couple of years ago. For instance, the technologies developed

by PGI and Caps Entreprise, and supported by Nvidia, have been embedded into the OpenACC standard, and Intel has abandoned its ArBB project in favour of OpenMP and the OpenCL standard with hit and miss fights with Nvidia Cuda technology.

“Te process of standardisation of language extensions started a couple of years ago”

Autotuning techniques Nowadays, the idea of an automatic soſtware adaptation to a specific accelerator, or autotuning, is becoming increasingly popular. It assumes that an application will modify itself to adapt to the hardware it is running on. Tis allows the customer to save a lot at the development stage as the programmer has nothing to do with the specifics of a particular hardware. In many cases, it also provides an additional performance gain. Te real implementations of the autotuning concept diverge from an additional module embedded by the compiler to a separate pre-processor to the third- party library. From the developer’s

viewpoint, using such technologies is similar to the use of an additional interface that enables to partially or completely ‘hide’ the hardware- specific process of soſtware tuning. Autotuning-based approaches are still under

development and now are mostly represented in scientific papers. Te very first examples of autotuning commercial products are HMPP from Caps Entreprise, TTG Apptimizer from

Jos Martin, principal software engineer at MathWorks

Like many applications, ours are relatively isolated from the hardware by layers such as the existence of compilers and libraries, and much of the performance relies

on the ability of those compilers to understand how to interact with the underlying hardware. It’s simply too costly to take any other approach as we, and applications vendors like us, simply don’t have the development resources to write the underlying libraries ourselves. Tere is very little risk involved in this approach, but the mitigation is that we need to be aware of the direction compilers are heading in, as well as the possible hardware developments that are on the horizon so that we can prompt the people developing our tool chains. If users are on the bleeding edge of

technology, they may struggle to get best performance out of any new hardware until all the relative factors have caught up. Sometimes trade-offs need to be made whereby if we tune the application in a certain way it will be faster on one architecture, but slower on another. Making that decision is very much a judgement call and we endeavour to keep the majority of users happy, with very little performance degradation experienced by the rest. Te good data we have regarding the hardware currently being used to run our soſtware, as well as the historical data relating to the uptake of new hardware across our product line, enables us to make an educated guess as to where our customers will invest. Optimisation, however, is stagnant

compared to areas such as cloud computing. Te industry would rather focus on those

“Optimisation is stagnant compared to areas such as cloud computing”

other developments because, with optimisation projects, only one cluster is being targeted. Cloud computing, on the other hand, has a common hardware, making it a far better investment. All applications vendors want their codes to run in as many places as possible, so if we optimise those codes to run on one cluster but not another it doesn’t benefit us or our users. To solve this there needs to be commonality between clusters – but, realistically, as long as hardware vendors continue to focus on aspects

other than optimisation, it probably won’t happen. Tis leaves us in the position of trying to ensure our applications run faster in the majority of cases. If there was a way around this it would make

soſtware vendors’ lives far easier, and provide users with applications they know will perform on any hardware they choose.

23

ttgLabs, and the GCC compiler plug-in, developed within the university project StarPU. As for the future evolution of various

approaches to optimisation, it is almost impossible to predict which particular paradigm will dominate the exaflops era. Obviously, all of them in one form or another will be on demand as ideal solutions for specific tasks. However, the development of hardware can have a significant impact on the soſtware tools. For example, if the transition to unified memory architecture will be accompanied by standardisation of all types of accelerators, the most promising approach will be related to language extensions that enable people to easily create an ‘already tuned’ application. If the number of different types of accelerators remains high, considerably improved autotuning technologies will prevail. Finally, we should not completely ignore

the rather unlikely scenario of when exaflop supercomputers will be inherently homogeneous. In such a case, all of the above problems of soſtware optimisation will disappear by themselves. In conclusion, it is worth mentioning that

these briefly described optimisation paradigms should not be considered as alternatives to each other and can (and sometimes should) be used together. However, the choice of particular tools for speeding up computational soſtware on GPUs should be preceded by selecting a particular paradigm to make the most of its advantages.

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32 | Page 33 | Page 34 | Page 35 | Page 36