91°µÍø

Skip to main content
SHARE
Publication

An Evaluation of the Effect of Network Cost Optimization for Leadership Class Supercomputers

Publication Type
Conference Paper
Book Title
SC24: International Conference for High Performance Computing, Networking, Storage and Analysis
Publication Date
Page Numbers
1 to 16
Issue
979-8-3503
Publisher Location
New Jersey, United States of America
Conference Name
SC24: International Conference for High Performance Computing, Networking, Storage and Analysis
Conference Location
Atlanta, Georgia, United States of America
Conference Sponsor
91°µÍø
Conference Date
-

Dragonfly-based networks are an extensively deployed network topology in large-scale high-performance computing due to their cost-effectiveness and efficiency. The US will soon have three Exascale supercomputers for leadership class workloads deployed using dragonfly networks. Compared to indirect networks of similar scale, the dragonfly network has considerably reduced cable lengths, cable counts, and switch counts, resulting in significant network cost savings for a given system size, however, these cost reductions result in reduced global minimal paths and more challenging routing. Additionally, large scale dragonfly networks often require a taper at the global link level, resulting in less bisection bandwidth than is achievable in other traditional non-blocking topologies of equivalent scale. While dragonfly networks have been extensively studied, they have yet to be fully evaluated in an extreme scale (i.e., exascale) system that targets capability workloads. In this paper, we present the results of the first large scale evaluation of a dragonfly network on an exascale system (Frontier) and compare its behavior to a similar scale fat-tree network on a previous generation TOP500 system (Summit). This evaluation aims to determine the effect of network cost optimizations by measuring a tapered topology’s impact on capability workloads. Our evaluation is based on a collection of synthetic microbenchmarks, mini-apps, and full scale applications. It compares the scaling efficiencies of each benchmark between the dragonfly-based Frontier and the fat-tree-based Summit systems. Our results show that a dragonfly network is $\sim \mathbf{3 0 \%}$ more cost efficient than a fat-tree topology, which amortizes to $\sim 3 \%$ of an exascale system cost. Furthermore, while tapered dragonfly networks impose significant tradeoffs, the impacts are not as broad as initially thought and are mostly seen in applications with global communication patterns, particularly all-to-all (e.g., FFT-based algorithms), but also local communication patterns (e.g., nearest-neighbor algorithms) that are sensitive to network performance variability.