Using Performance Tools to Support Experiments in HPC Resilience...

by Thomas J Naughton Iii, Swen Boehm, Christian Engelmann, Geoffroy R Vallee

Publication Type

Conference Paper

Book Title

In Lecture Notes in Computer Science: Proceedings of the 19th European Conference on Parallel and Distributed Computing (Euro-Par) 2013 Workshops: 6th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids

Publication Date

April, 2014

Page Numbers

727 to 736

Volume

8374

Conference Name

European Conference on Parallel and Distributed Computing (Euro-Par)

Conference Location

Aachen, Germany

Conference Date

Aug 26, 2013 - Aug 30, 2013

Abstract

The high performance computing (HPC) community is working to address fault tolerance and resilience concerns for current and future large scale computing platforms. This is driving enhancements in the programming environ- ments, specifically research on enhancing message passing libraries to support fault tolerant computing capabilities. The community has also recognized that tools for resilience experimentation are greatly lacking. However, we argue that there are several parallels between “performance tools” and “resilience tools”. As such, we believe the rich set of HPC performance-focused tools can be extended (repurposed) to benefit the resilience community.
In this paper, we describe the initial motivation to leverage standard HPC per- formance analysis techniques to aid in developing diagnostic tools to assist fault tolerance experiments for HPC applications. These diagnosis procedures help to provide context for the system when the errors (failures) occurred. We describe our initial work in leveraging an MPI performance trace tool to assist in provid- ing global context during fault injection experiments. Such tools will assist the HPC resilience community as they extend existing and new application codes to support fault tolerances.

91����

Using Performance Tools to Support Experiments in HPC Resilience...

Abstract

Researchers

Organizations

91��