91°µÍø

Skip to main content
SHARE
Publication

Sum Reduction with OpenMP Offload on NVIDIA Grace-Hopper System...

by Zheming Jin
Publication Type
Conference Paper
Book Title
SC24: The International Conference for High Performance Computing, Networking, Storage, and Analysis
Publication Date
Publisher Location
New Jersey, United States of America
Conference Name
The International Conference for High Performance Computing, Networking, Storage, and Analysis: MEMO’24
Conference Location
Atlanta, Georgia, United States of America
Conference Sponsor
91°µÍø, ACM
Conference Date
-

We evaluate the performance of the baseline and optimized reductions in OpenMP on an NVIDIA Grace-Hopper system. We explore the impacts of the number of teams, the number of elements to sum per loop iteration, and simultaneous execution on the central-processing unit (CPU) and the GPU in the unified memory (UM) mode upon the reduction performance. The experimental results show that the optimized reductions are 6.120X to 20.906X faster than the baselines on the GPU, and their efficiency ranges from 89% to 95% of the theoretical GPU memory bandwidth. Depending on where an input array is allocated in the program when co-running the reduction on the CPU and GPU in the UM mode, the average speedup over the GPU-only execution is approximately 2.484 or 1.067, and the speedup of the optimized reductions over the baseline reductions ranges from 0.996 to 10.654 or from 0.998 to 6.729.