Abstract
Proposed exascale systems will present a number of
considerable resiliency challenges. In particular, DRAM
soft-errors, or bit-flips, are expected to greatly increase
due to the increased memory density of these systems.
Current hardware-based fault-tolerance methods will be
unsuitable for addressing the expected soft error frequency
rate. As a result, additional software will be needed to
address this challenge. In this paper we introduce LIBSDC,
a tunable, transparent silent data corruption detection and
correction library for HPC applications. LIBSDC provides
comprehensive SDC protection for program memory by
implementing on-demand page integrity verification.
Experimental benchmarks with Mantevo HPCCG show that once
tuned, LIBSDC is able to achieve SDC protection with 50\%
overhead of resources, less than the 100\% needed for double
modular redundancy.