AutomaDeD: Debugging Large-Scale Parallel Programs

Started: 2009 Ended: Ongoing

Contributors: Ignacio Laguna, Subrata Mitra, Saurabh Bagchi (Purdue)

Todd Gamblin, Martin Schulz, Dong Ahn, Bronis Supinski, Greg Bronevetsky (LLNL)

 

Current techniques for resilience are insufficient for exascale systems (i.e., systems capable of executing 1018 floating point operations per second), and unless radical changes are made across the entire software stack, exascale systems may never compute reliable scientific results. Current resilience techniques will also be too costly at exascale. Approaches based on checkpoint/restart require enough I/O bandwidth to record checkpoints faster than faults occur, but I/O is not expected to keep pace with processing power, and has already experienced a widening gap with respect to processing power. Pure checkpointing approaches will thus spend all of their time in I/O instead of performing useful calculations. Replication-based approaches have promise, but blind replication of all tasks will halve the available performance on exascale machines at best, wasting CPU cycles and energy on redundant work.

A targeted approach is needed to allow exascale runtime systems to isolate regions where faults occur and replicate only those parts of the system. To enable this, we need runtime systems that monitor and analyze their own behavior to determine when to take preventive action. This type of analysis has been investigated before, but existing approaches aggregate system-wide data to a central point for analysis, which is unscalable and time-consuming. Further, existing analyses assume that parallel application behavior is relatively homogeneous across nodes and tasks. Such approaches will be ill-equipped to cope with the pervasive adaptive behavior of exascale applications and heterogeneity of exascale platforms.

In our work, we developed AutomaDeD, a tool that detects errors based on runtime information of control paths that the parallel application follows and the times spent in each control block. AutomaDeD suggests possible root causes of detected errors by pinpointing, in a probabilistic rank-ordered manner, the erroneous process and the code region in which the error arose. Intuitively, the erroneous tasks often form a small minority of the full set of tasks. Hence, they are outliers when we cluster the tasks, based on their features related to control flow and timing. Further, in the time dimension, the executions in the first few iterations are more likely to be correct than in later iterations, which we also leverage to determine correct or erroneous labels; else we make use of some labeled correct runs, if available.

 The tool AutomaDeD is being used internally within Lawrence Livermore National Lab (LLNL) to debug their large-scale programs, in conjunction with an existing tool called Stat. It has already been able to find some hard-to-find bugs, like one in an IBM MPI library, that only showed up at > 4,096 processes on the BlueGene machines. One of the leading student developers of AutomaDeD, Ignacio Laguna, joined LLNL as a full-time employee.

References

[1]   Subrata Mitra, Ignacio Laguna, Saurabh Bagchi, Dong H. Ahn, Martin Schulz, Todd Gamblin, “Scalable Parallel Debugging via Loop-Aware Progress Dependence Analysis,” Accepted to appear as a poster at the Supercomputing Conference, Denver, CO, November 17-22, 2013.

[2]   Bowen Zhou, Jonathan Too, Milind Kulkarni, and Saurabh Bagchi, “WuKong: Automatically Detecting and Localizing Bugs that Manifest at Large System Scales,” At the 22nd International ACM Symposium on High Performance Parallel and Distributed Computing (HPDC), pp. 1-12, New York City, New York, June 17-21, 2013.

[3]   Ignacio Laguna, Dong H. Anh, Bronis R. de Supinski, Saurabh Bagchi, and Todd Gamblin, “Probabilistic Diagnosis of Performance Faults in Large Scale Parallel Applications,” At the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 213-222, September 19-23, 2011, Minneapolis, MN.

[4]   Greg Bronevetsky, Ignacio Laguna, Saurabh Bagchi, and Bronis R. de Supinski, “Automatic Fault Characterization via Abnormality-Enhanced Classification,” At the 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1-12, June 25-28, 2012, Boston, MA.

[5]   Ignacio Laguna, Todd Gamblin, Bronis R. de Supinski, Saurabh Bagchi, Greg Bronevetsky, Dong H. Anh, Martin Schulz, and Barry Rountree, “Large Scale Debugging of Parallel Tasks with AutomaDeD,” At the IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (Supercomputing), pp. 1-12, Seattle, Washington, November 12-18, 2011.

[6]   Greg Bronevetsky, Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin Schulz, “AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks,” At the 40th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 231-240, June 28-July 1, 2010, Chicago, IL.