Abstract
We evaluate the performance of the baseline and optimized reductions in OpenMP on an NVIDIA Grace-Hopper system. We explore the impacts of the number of teams, the number of elements to sum per loop iteration, and simultaneous execution on the central-processing unit (CPU) and the GPU in the unified memory (UM) mode upon the reduction performance. The experimental results show that the optimized reductions are 6.120X to 20.906X faster than the baselines on the GPU, and their efficiency ranges from 89% to 95% of the theoretical GPU memory bandwidth. Depending on where an input array is allocated in the program when co-running the reduction on the CPU and GPU in the UM mode, the average speedup over the GPU-only execution is approximately 2.484 or 1.067, and the speedup of the optimized reductions over the baseline reductions ranges from 0.996 to 10.654 or from 0.998 to 6.729.