Fault-Tolerant Deep Learning Cache with Hash Ring for Load Balancing in HPC Systems

Show authors

Publication Type

Conference Paper

Book Title

SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis

Publication Date

November, 2024

Page Numbers

1349 to 1357

Publisher Location

New Jersey, United States of America

Conference Name

SC2024: International Conference for High Performance Computing, Networking, Storage, and Analysis

Conference Location

Atlanta, Georgia, United States of America

Conference Sponsor

ACM

Conference Date

Nov 17, 2024 - Nov 21, 2024

Abstract

Large-scale DL on HPC systems like Frontier and Summit uses distributed node-local caching to address scalability and performance challenges. However, as these systems grow more complex, the risk of node failures increases, and current caching approaches lack fault tolerance, jeopardizing large-scale training jobs. We analyzed six months of SLURM job logs from Frontier and found that over 30% of jobs failed after an average of 75 minutes. To address this, we propose fault-tolerance strategies that recache data lost from failed nodes using a hash ring technique for balanced data recaching in the distributed node-local caching, reducing reliance on the PFS. Our extensive evaluations on Frontier showed that the hash ring-based recaching approach reduced training time by approximately 25% compared to the approach that redirects I/O to the PFS after node failures and demonstrated effective load balancing of training data across nodes.

91����

Fault-Tolerant Deep Learning Cache with Hash Ring for Load Balancing in HPC Systems

Abstract

Researchers

Organizations

91��