91做厙

Skip to main content
SHARE
Publication

Fault-Tolerant Deep Learning Cache with Hash Ring for Load Balancing in HPC Systems

Publication Type
Conference Paper
Book Title
SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis
Publication Date
Page Numbers
1349 to 1357
Publisher Location
New Jersey, United States of America
Conference Name
SC2024: International Conference for High Performance Computing, Networking, Storage, and Analysis
Conference Location
Atlanta, Georgia, United States of America
Conference Sponsor
ACM
Conference Date
-

Large-scale DL on HPC systems like Frontier and Summit uses distributed node-local caching to address scalability and performance challenges. However, as these systems grow more complex, the risk of node failures increases, and current caching approaches lack fault tolerance, jeopardizing large-scale training jobs. We analyzed six months of SLURM job logs from Frontier and found that over 30% of jobs failed after an average of 75 minutes. To address this, we propose fault-tolerance strategies that recache data lost from failed nodes using a hash ring technique for balanced data recaching in the distributed node-local caching, reducing reliance on the PFS. Our extensive evaluations on Frontier showed that the hash ring-based recaching approach reduced training time by approximately 25% compared to the approach that redirects I/O to the PFS after node failures and demonstrated effective load balancing of training data across nodes.