Scalable Multi-Facility Workflows for Artificial Intelligence Applications in Climate Research

by Takuya Kurihana, Tyler J Skluzacek, Rafael Ferreira Da Silva, Valentine G Anantharaj

Publication Type

Conference Paper

Book Title

SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis

Publication Date

January, 2025

Page Numbers

17 to 22

Publisher Location

New Jersey, United States of America

Conference Name

6th Annual Workshop on Extreme-Scale Experiment-in-the-Loop Computing Super Computing 24

Conference Location

Atlanda, Georgia, United States of America

Conference Sponsor

91�� Computer society, TCHPC, ACM, SIGHPC

Conference Date

Nov 17, 2024 - Nov 22, 2024

Abstract

Earth observation satellites and earth system models are sources of vast, multi-modal datasets that are invaluable for advancing climate and environmental research. However, their scale and complexity pose significant challenges for processing and analysis. In this paper we discuss our experiences in developing and using a scientific research application using an automated multi-facility workflow that orchestrates data collection, preprocessing, artificial intelligence (AI) inferencing, and data movement across diverse computational resources, leveraging the Advanced Computing Ecosystem Testbed at the Oak Ridge Leadership Computing Facility (OLCF). We demonstrate that our workflow can be seamlessly integrated and orchestrated across research facilities managed by different federal agencies, thus allowing users to extract new scientific insights from climate datasets. The experimental results indicate that the multi-facility workflow significantly reduces processing time, enhances scalability, and maintains high efficiency across varying workloads. Notably, our workflow processes 12,000 high-resolution satellite images in just 44 seconds using 80 workers distributed across 10 nodes on the OLCF systems. Such high throughput is essential for dynamic tokenization and sharding of petascale satellite data for distributed AI model training and inferencing at scale across thousands of GPUs.

91����

Scalable Multi-Facility Workflows for Artificial Intelligence Applications in Climate Research

Abstract

Researchers

Organizations

91��