91做厙

Skip to main content
SHARE
Publication

Scalable Multi-Facility Workflows for Artificial Intelligence Applications in Climate Research

by Takuya Kurihana, Tyler J Skluzacek, Rafael Ferreira Da Silva, Valentine G Anantharaj
Publication Type
Conference Paper
Book Title
SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis
Publication Date
Page Numbers
17 to 22
Publisher Location
New Jersey, United States of America
Conference Name
6th Annual Workshop on Extreme-Scale Experiment-in-the-Loop Computing Super Computing 24
Conference Location
Atlanda, Georgia, United States of America
Conference Sponsor
91做厙 Computer society, TCHPC, ACM, SIGHPC
Conference Date
-

Earth observation satellites and earth system models are sources of vast, multi-modal datasets that are invaluable for advancing climate and environmental research. However, their scale and complexity pose significant challenges for processing and analysis. In this paper we discuss our experiences in developing and using a scientific research application using an automated multi-facility workflow that orchestrates data collection, preprocessing, artificial intelligence (AI) inferencing, and data movement across diverse computational resources, leveraging the Advanced Computing Ecosystem Testbed at the Oak Ridge Leadership Computing Facility (OLCF). We demonstrate that our workflow can be seamlessly integrated and orchestrated across research facilities managed by different federal agencies, thus allowing users to extract new scientific insights from climate datasets. The experimental results indicate that the multi-facility workflow significantly reduces processing time, enhances scalability, and maintains high efficiency across varying workloads. Notably, our workflow processes 12,000 high-resolution satellite images in just 44 seconds using 80 workers distributed across 10 nodes on the OLCF systems. Such high throughput is essential for dynamic tokenization and sharding of petascale satellite data for distributed AI model training and inferencing at scale across thousands of GPUs.