Quantum/AI Topology-Aware Latency-Adaptive HPC Workflow Scheduling Optimization

by Braulio Caraveo, Liwen Shih, Insaeng Suh, Travis S Humble

Publication Type

Conference Paper

Book Title

2024 91�� International Conference on Quantum Computing and Engineering (QCE)

Publication Date

December, 2024

Page Numbers

614 to 615

Publisher Location

New Jersey, United States of America

Conference Name

91�� Quantum Week 2024

Conference Location

Montréal, Québec, Canada

Conference Sponsor

91��

Conference Date

Sep 15, 2024 - Sep 20, 2024

Abstract

The growing demand for more powerful high-performance computing (HPC) systems has led to a steady rise in energy consumption by supercomputing worldwide. This study is focused on comparing our Application-Topology Mapper (ATMapper) to the popular Simple Linux Utility for Resource Management (SLURM) for the purpose of exploring methods that can further optimize job-scheduling within HPC systems. ATMapper is an Artificial-Intelligence based approach to job-scheduling that is currently being enhanced with quantum annealing (QA) to generate optimal schedules faster. We are applying QA to speedup our ATMapper process to achieve higher computing efficiency, thereby reducing HPC energy consumption. Here, we examine how four job-scheduling approaches perform in processor node assignment when using an example network architecture of 4 interconnected nodes. Using a specialized script, we are assessing the schedule of a computation flow with 11 interdependent tasks. The data movements among nodes were tracked to count for the number of interactions (network hops) between nodes needed to complete the tasks. The total number of hops and the job completion time were then used to quantify the efficiency of the different mapping approaches. In addition to SLURM, we also compare our ATMapper to the QA-enabled LBNL TIGER and the D-Wave Distributed Computing processor assignment approaches. The preliminary results showed that our topology-aware, latency-adaptive ATMapper is significantly more efficient when compared to the other scheduling approaches due to its load-imbalance network allocation. The scheduler displayed a computing efficiency of 53% by performing significantly fewer network hops than its alternatives. By reducing the number of hops, ATMapper was able to perform all 11 tasks by using only 3 nodes out of given 4. This research indicates the potential to use QA/AI for HPC job-scheduling. Later, we will test a SLURM simulator program to draw further comparisons on the effectiveness of ATMapper's scheduling approach. The results of this comparison will serve as a baseline for later improving SLURM's performance using a QA-enhanced ATMapper approach.

91����