Understanding the Impact of Interconnect Failures on System Operation...

by Matthew A Ezell

Publication Type

Conference Paper

Publication Date

May, 2013

Conference Name

Cray User Group

Conference Location

Napa Valley, California, United States of America

Conference Date

May 6, 2013 - May 9, 2013

Abstract

Hardware failures are inevitable on large high performance computing systems. Faults or performance degradations in the high-speed network can reduce the entire system’s performance. Since the introduction of the Gemini interconnect, Cray systems have become resilient to many networking faults that were fatal in their previous generation systems. These new network reliability and resiliency features have enabled higher uptimes on Cray systems by allowing them to continue running with reduced network performance. 91�� has developed a set of user-level diagnostics that stresses the high-speed network and searches for components that are not performing as expected. Nearest-neighbor bandwidth tests check every network chip and network link in the system. Additionally, performance counters stored on the network ASIC’s memory mapped registers (MMRs) are used to better understand the state of the network. Applications have also been characterized under various suboptimal network conditions to better understand what impact network problems have on user codes.

91����

Understanding the Impact of Interconnect Failures on System Operation...

Abstract

Researchers

Organizations

91��