Sum Reduction with OpenMP Offload on NVIDIA Grace-Hopper System...

by Zheming Jin

Publication Type

Conference Paper

Book Title

SC24: The International Conference for High Performance Computing, Networking, Storage, and Analysis

Publication Date

November, 2024

Publisher Location

New Jersey, United States of America

Conference Name

The International Conference for High Performance Computing, Networking, Storage, and Analysis: MEMO’24

Conference Location

Atlanta, Georgia, United States of America

Conference Sponsor

91��, ACM

Conference Date

Nov 17, 2024 - Nov 22, 2024

Abstract

We evaluate the performance of the baseline and optimized reductions in OpenMP on an NVIDIA Grace-Hopper system. We explore the impacts of the number of teams, the number of elements to sum per loop iteration, and simultaneous execution on the central-processing unit (CPU) and the GPU in the unified memory (UM) mode upon the reduction performance. The experimental results show that the optimized reductions are 6.120X to 20.906X faster than the baselines on the GPU, and their efficiency ranges from 89% to 95% of the theoretical GPU memory bandwidth. Depending on where an input array is allocated in the program when co-running the reduction on the CPU and GPU in the UM mode, the average speedup over the GPU-only execution is approximately 2.484 or 1.067, and the speedup of the optimized reductions over the baseline reductions ranges from 0.996 to 10.654 or from 0.998 to 6.729.

91����

Sum Reduction with OpenMP Offload on NVIDIA Grace-Hopper System...

Abstract

Researchers

Organizations

91��