CLamsee - Parallel Reduction

This tutorial demonstrates the implementation and analysis of the parallel reduction algorithm using the CLamsee slang. The sessions for this tutorial is inlcuded in the ViSlang binary package but you can also download them here. CLamsee is a a general slang for parallel computing that consists of a subset of OpenCL C, an introduction can be found here.


Parallel reduction is a well understood algorithm and optimizations for GPUs. Although reduction is a fairly simple concept, parallel implementations thereof are much harder to comprehend. The algorithm is frequently used in education as an example to explain GPU optimization strategies. The reduction algorithm pairwise reduces a set of elements to one final element using a reduction operator. Typical reduction operators are min, max, sum, etc. and are associative and commutative, which makes different implementations possible.

Reduction Snippet

The implementation of the parallel reduction algorithm uses a tree-based approach, as shown above. Each workgroup reduces a certain set of elements. To continue the redection a global synchronization between the different work groups is required. OpenCL does not support global synchronization since it would be expensive to build in hardware for GPUs. Therefore, the parallel uses kernel decomposition, which splits the result in several invocations of the kernel. This decomposition is achieves the required global synchronization during the computation. The complete execution of the parallel reduction uses several kernel invocations resulting in several pyramidal reduction levels.

The NVIDIA OpenCL SDK provides different implementations of the parallel reduction algorithm, which progressively make changes to the code for performance improvements. We have analyzed the first three of these implementations with our system. In the following, we investigate their branching behavior and the local memory accesses.

Implementation Version 1

The first step in the implementation of the device code loads global memory into local memory, to achieve efficient access of subsequent instructions. Then the parallel reduction is performend. This part is changed within the different implementations. Subsequently, the result is written back to global memory. This kernel has to be called several times to achieve the final result of the reduction.

Device Code

using Device; kernel reduce0(global float *g_idata, global float *g_odata, int n, int local_size) { // allocation of local memory local float sdata[local_size]; // ================== Load Shared Memory ================== int tid = get_local_id(0); int i = get_global_id(0); if (i < n) { // load from global memory into local memory sdata[tid] = g_idata[i]; } else { sdata[tid] = 0; } // synchronize local memory barrier(CLK_LOCAL_MEM_FENCE); // ================== Parallel Reduction ================== for (int s=1; s < get_local_size(0); s *= 2) { if ((tid % (2*s)) == 0) { sdata[tid] += sdata[tid + s]; } barrier(CLK_LOCAL_MEM_FENCE); } // ================ Write back to global mem ================ if (tid == 0) { g_odata[get_group_id(0)] = sdata[0]; } }

Host Code

// set worksize integer local_size = 32; integer global_size = 256; Device.setLocalWorkSize(local_size,1); Device.setGlobalWorkSize(global_size,1); // create data float [global_size] input; float [global_size/local_size] output; Arrays.fill(input, 10); // call kernel Device.reduce0(input, output, global_size, local_size);

Implementation Version 2

As shown in the analysis, the first implementation has highly divergent warps, which is very inefficient. This variation of the implementation removes the divergence in the inner loop by using a non-divergent branch and a strided index.

Device Code

... // ============ Parallel Reduction ============ for (int s=1; s < get_local_size(0); s *= 2) { int index = 2 * s * tid; if (index < get_local_size(0)) { sdata[index] += sdata[index + s]; } barrier(CLK_LOCAL_MEM_FENCE); } ...

Implementation Version 3

The second version reduces the divergence in the warps but introduces local memory bank conflicts. This implementation changes adapts the code so that the bank conflicts are removed.

Device Code

... // ============ Parallel Reduction ============ for (int s = get_local_size(0)/2; s > 0; s /= 2) { if (tid < s) { sdata[tid] += sdata[tid + s]; } barrier(CLK_LOCAL_MEM_FENCE); } ...


The CLamsee slang offers different visualizations in order to investigate the execution behavior of the implementation. In this case we use the watch annotations to investigate the local memory behavior of "sdata".

... // allocation of local memory local float sdata[local_size]; @watch sdata; ...

And we the annotation to investigate the branching behavior.

... // ================ Write back to global mem ================ if (tid == 0) { g_odata[get_group_id(0)] = sdata[0]; } } @computeBranchingBehavior;

Comparison of Version 1 and Version 2

The investigation of branching behavior shows that the first implementation uses a highly divergent approach to utilize different work items. The highlights (bottom left) show that half of the work items of the hovered warp representation follows a different branching pattern than the rest. This origins from the fact that the implementation only uses every second work item to perform a reduction step. Investigating the second implementation (right) shows that the workload is balanced between warps and that warp divergence is reduced.

Reduction Comparison

Comparison of Version 2 and Version 3

However, the analysis of local memory accesses in reveals the second implementation now suffers from bank conflicts. Hovering over a memory bank (left), displays the occurrences of bank conflicts. This implementation uses an interleaved indexing of the local memory. The next implementation changes the indexing to a sequential one, which is visible in the visualizations, where the memory bank conflicts are now removed.

Reduction Comparison

Copyright © 2015-2020