Abstract
Hardware failures are inevitable on large high performance computing systems. Faults or performance degradations in the high-speed network can reduce the entire system’s performance. Since the introduction of the Gemini interconnect, Cray systems have become resilient to many networking faults that were fatal in their previous generation systems. These new network reliability and resiliency features have enabled higher uptimes on Cray systems by allowing them to continue running with reduced network performance. 91°µÍø has developed a set of user-level diagnostics that stresses the high-speed network and searches for components that are not performing as expected. Nearest-neighbor bandwidth tests check every network chip and network link in the system. Additionally, performance counters stored on the network ASIC’s memory mapped registers (MMRs) are used to better understand the state of the network. Applications have also been characterized under various suboptimal network conditions to better understand what impact network problems have on user codes.