Fault Tolerance for High-Performance Computing Clusters and Applications

Overview

Summary

As today’s distributed commercial and scientific applications increase in complexity and scale, providing fault tolerance capabilities becomes increasingly difficult. Faults can arise from multiple sources—such as software bugs, hardware errors and unexpected runtime conditions—and can affect an application in different phases of its execution.  The increase in size of the largest supercomputers and data centers on which these applications run imposes challenges to fault-tolerance techniques such as checkpointing and fault detection and localization. On one hand, these techniques need to provide fault-tolerance in a scalable manner—they cannot become a bottleneck as the number of processes and input data increase, and on the other hand, the added overhead should be small enough so that it ultimately reduces the end-to-end completion time of the user applications.

Achieved Technical Goals

Reliable/Scalable Checkpointing

A Fine-Grained Cycle Sharing (FGCS) system aims at utilizing the large amount of idle computational resources available on the Internet. In such a cycle sharing system, PC owners voluntarily make their CPU cycles available as part of a shared computing environment, but only if they incur no significant inconvenience from letting a foreign job (guest process) run on their own machines. To exploit available idle cycles under this restriction, an FGCS system allows a guest process to run concurrently with the jobs belonging to the machine owner (host processes). However, for guest users, these free computation resources come at the cost of fluctuating availability due to various reasons—software or hardware failure, host workload increasing beyond a threshold or simply the return of the owner of that machine.

To achieve high performance in the presence of resource volatility, checkpointing and rollback have been widely applied. These techniques enable an application to periodically save a snapshot of the application’s state onto a stable. A job may get evicted from its execution machine any time and can recover from this failure by rolling back to the latest checkpoint.

Most production FGCS systems, such as Condor, store checkpoints to dedicated storage servers. This solution works well when a cluster only belongs to a small administrative domain or there are a large number of storage servers. However, it does not scale well with the growing sizes of grids having thousands of home users as participants, and geographically separated university campuses and research labs. We propose Falcon, a distributed checkpoint/recovery framework that utilizes free storage space available on the grid resources—as opposed to using dedicated checkpoint servers.

We have evaluated Falcon on the production Condor pool—DiaGrid—with multiple biomedical benchmark applications. Our experiments demonstrate that performance of an application with Falcon improves between 11% and 44%, depending on the size of checkpoints and whether the storage server for Condor’s solution was located close to the compute host, and the performance of Falcon scales as the checkpoint sizes of different scientific applications increase.Scalable Fault Detection and Diagnosis

We have developed AutomaDeD (automata-based debugging of dissimilar tasks), a fault-detection framework that allows developers to focus debugging effort on the erroneous period of time, parallel task and code region that are affected by faults in MPI applications.

Through non-intrusive runtime monitoring, AutomaDeD models the behavior of parallel tasks using a semi-Markov model (SMM). States in an SMM represent communication code regions, i.e., MPI communication routines, and computation code regions, i.e., code executed between two MPI communication routines. A state consists of call stack information such as the module name and offset of the function calls that are currently active in the MPI process. An edge represents a transition between two states in the program. Two attributes are assigned to edges: a transition probability that captures the frequency of occurrence of the transition, and a time probability distribution that allows us to model the time spent in the source state conditioned on the destination state.

The application’s execution is divided into a series of time periods called phases in which the application repeatedly exhibits the same execution pattern. First, AutomaDeD performs clustering of SMMs to find the phase in which the error is first manifested. Then,  AutomaDeD seeks to find the anomalous set of tasks and states (or code regions) where a fault is manifested. It performs this by finding similarities between tasks using scalable clustering or nearest-neighbor techniques. The idea behind this approach is that MPI applications often can be group based on their behavior (which is defined by our computation and communication states in our SMMs). The natural number of groups in an application can be provided by the developer (e.g., a  master-slave application has two clusters or groups), from clusters of previous phases in the same run (before a fault is manifested), or can be inferred from traces of previous normal runs. The erroneous tasks are often  those that deviate from the normal number of clusters, for example, by creating a separate cluster with few elements.

AutomaDeD has been used in detecting and diagnosing difficult-to-detect real-world bugs that appear in large-scale scientific applications. We have measured its fault coverage by injecting a broad spectrum of common faults into the NAS Parallel Benchmarks. The fault detection and localization analysis of AutomaDeD scales and results in minimum application slow down, in a Linux cluster at Lawrence Livermore National Laboratory with up to 5K processes and in IBM BlueGene/P system with up to 100K processes.

 


Students

  • Ignacio Laguna, ilaguna AT purdue DOT edu
  • Bowen Zhou, bzhou AT purdue DOT edu Webpage
  • Tanzima Zerin Islam, tislam AT purdue DOT edu Webpage

Funding Source

This work has been supported by Lawrence Livermore National Lab under Contract DEAC52-07NA27344. We are also grateful for software license from Grammatech for software for backward slicing called CodeSurfer.

Last modified: March 18, 2015

Download Software