Scientific Computing World June/July 2019

HIGH PERFORMANCE COMPUTING Panel 2

for(int iz= HALF_LENGTH; iz<nz- HALF_LENGTH; iz++) { for(int iy= HALF_LENGTH; iy<ny- HALF_LENGTH; iy++) { for(int ix= HALF_LENGTH; ix<nx- HALF_LENGTH; ix++) { int offset = iz*dimnXnY + iy*nx + ix; float value = 0.0; value += ptr_prev[offset]*coeff[0];

Panel 3

float *prev_base = (float*)_mm_malloc(nsize*sizeof(float)+ 32*sizeof(float), 64); float *next_base = (float*)_mm_malloc(nsize*sizeof(float)+ 32*sizeof(float), 64); float *vel_base = (float*)_mm_malloc(nsize*sizeof(float)+ 32*sizeof(float), 64);

Panel 4

p.prev = &prev_base[16-HALF_LENGTH]; p.next = &next_base[16-HALF_LENGTH]; p.vel = &vel_base[16-HALF_LENGTH];

proposes a dependency analysis. Using instrumentation, the tool will check all memory accesses in the loop at runtime and indicate whether dependencies were found, indicating that vectorisation is unsafe. After running the dependency analysis, Advisor indicates that no dependencies were found, and suggests ways to force vectorisation. Following this advice, we add the

#pragma omp simd OpenMP directive at line 42 (as indicated in the Advisor survey analysis).

Running on a two-socket Intel Xeon CPU E5-2699 v3 @ 2.30GHz, this addition improves performance from 27.7 Gflops to 47.6 Gflops (1.7-times speedup).

Alignment with Advisor We now run the survey again adding another analysis called 'trip counts', to see if Advisor can find other vectorisation- related problems (Figures 2 and 3). The new survey shows that our previous hotspots are now vectorised, but Advisor reports three different implementations for these two loops: vectorised (body), peeled, and remainder. Most of the time is spent in the vectorised body (a vectorised implementation of the loop where data are packed and processed into SIMD registers), but we can also see four iterations in the peel and remainder. It is usually good practice to avoid running peel or remainder loops when possible. Peel and remainder are scalar implementations, which means that we are not taking full advantage of our vectorisation capabilities. In addition,

www.scientific-computing.com | @scwmagawzine

”The code is fairly small, but in larger loops it might also impact the efficiency of the instruction cache, further reducing performance”

these loops are part of a bigger loop that is called a lot (exactly 3,407,872 times). This means that for every call to these loops, we need to load three implementations from the instruction cache. The code here is fairly small, but in larger loops it might also impact the efficiency of the instruction cache, further reducing performance. To solve this issue, we need to align our

arrays. Because we are not implementing the special case for the border of the 3D volume (the so-called boundary conditions), we start propagating the wave at index HALF_LENGTH instead of index 0. Therefore, we need to ensure that this index will be aligned to guarantee that we can efficiently load the data into the SIMD registers (see panel 2).

At the allocation site, instead of using

malloc, we use _mm_malloc, which allows us to specify alignment. We will have to pad the arrays later, but in order to allocate enough space, we add 32 elements to each of our arrays and specify an alignment of 64 bytes that will also work for AVX512 if we need it later (see panel 3). The next step is to pad our arrays by HALF_LENGTH to guarantee that once we reach the first element which we actually load, its address will be cache line aligned (see panel 4). We are using single precision numbers,

so a single cache line can store 16 values. The last step is to pass p.prev, p.next,

and p.vel to our function in the main file: iso_3dfd(p.next, p.prev, p.vel, coeff, p.n1, p.n2, p.n3, p.num_threads,….) These changes improve performance

from 47.6 Gflops to 57.2 Gfllops (1.2-times speedup).

It is possible to run the survey and trip

counts analyses again in Advisor, to verify that the peels and remainders no longer appear (see Figure 4). We can also see that in the previous

version we were iterating 30 times in the vectorised loop body, but in the new version we are running 31 iterations.

Conclusions In this article, we have improved the vectorisation of Iso3DFD with the help of Intel Advisor. We started by checking the status of our main loops and observed that the compiler only generated scalar implementations. Next, we verified that vectorisation was possible using the dependency analysis, and forced the compiler to vectorise the main loops. Finally, Advisor detected peels and remainders, indicating that the vectorisation could still be improved. Using _mm_malloc and adding some padding to improve alignment, we ended up with a fully-vectorised implementation. The optimisations presented in this document moved the performance from 27.7 Gflops to 57.2 Gflops (2.1-times speedup).

Intel Advisor can be downloaded for free; https://software.intel.com/en-us/advisor/ choose-download

 Figure 4 June/July 2019 Scientific Computing World 15

Page 1 | Page 2 | Page 3 | Page 4 | Page 5 | Page 6 | Page 7 | Page 8 | Page 9 | Page 10 | Page 11 | Page 12 | Page 13 | Page 14 | Page 15 | Page 16 | Page 17 | Page 18 | Page 19 | Page 20 | Page 21 | Page 22 | Page 23 | Page 24 | Page 25 | Page 26 | Page 27 | Page 28 | Page 29 | Page 30 | Page 31 | Page 32

orderForm.title