This tutorial demonstrates the implementation and analysis of the parallel reduction algorithm using the CLamsee slang. The sessions for this tutorial is inlcuded in the ViSlang binary package but you can also download them here. CLamsee is a a general slang for parallel computing that consists of a subset of OpenCL C, an introduction can be found here.
Parallel reduction is a well understood algorithm and optimizations for GPUs. Although reduction is a fairly simple concept, parallel implementations thereof are much harder to comprehend. The algorithm is frequently used in education as an example to explain GPU optimization strategies. The reduction algorithm pairwise reduces a set of elements to one final element using a reduction operator. Typical reduction operators are min, max, sum, etc. and are associative and commutative, which makes different implementations possible.
The implementation of the parallel reduction algorithm uses a tree-based approach, as shown above. Each workgroup reduces a certain set of elements. To continue the redection a global synchronization between the different work groups is required. OpenCL does not support global synchronization since it would be expensive to build in hardware for GPUs. Therefore, the parallel uses kernel decomposition, which splits the result in several invocations of the kernel. This decomposition is achieves the required global synchronization during the computation. The complete execution of the parallel reduction uses several kernel invocations resulting in several pyramidal reduction levels.
The NVIDIA OpenCL SDK provides different implementations of the parallel reduction algorithm, which progressively make changes to the code for performance improvements. We have analyzed the first three of these implementations with our system. In the following, we investigate their branching behavior and the local memory accesses.
The first step in the implementation of the device code loads global memory into local memory, to achieve efficient access of subsequent instructions. Then the parallel reduction is performend. This part is changed within the different implementations. Subsequently, the result is written back to global memory. This kernel has to be called several times to achieve the final result of the reduction.
As shown in the analysis, the first implementation has highly divergent warps, which is very inefficient. This variation of the implementation removes the divergence in the inner loop by using a non-divergent branch and a strided index.
The second version reduces the divergence in the warps but introduces local memory bank conflicts. This implementation changes adapts the code so that the bank conflicts are removed.
The CLamsee slang offers different visualizations in order to investigate the execution behavior of the implementation. In this case we use the watch annotations to investigate the local memory behavior of "sdata".
And we the annotation to investigate the branching behavior.
However, the analysis of local memory accesses in reveals the second implementation now suffers from bank conflicts. Hovering over a memory bank (left), displays the occurrences of bank conflicts. This implementation uses an interleaved indexing of the local memory. The next implementation changes the indexing to a sequential one, which is visible in the visualizations, where the memory bank conflicts are now removed.
Copyright © 2015-2020 vislang.net