Dependable Computing Systems Laboratory


Home	Projects	Publications	Presentations	People	News	Activities	About DCSL	Internal


<< All Projects	Fault Tolerance for High-Performance Computing Clusters and Applications
Summary	As today's distributed commercial and scientific applications increase in complexity and scale, providing fault tolerance capabilities becomes increasingly difficult. Faults can arise from multiple sources—such as software bugs, hardware errors and unexpected runtime conditions—and can affect an application in different phases of its execution. The increase in size of the largest supercomputers and data centers on which these applications run imposes challenges to fault-tolerance techniques such as checkpointing and fault detection and localization. On one hand, these techniques need to provide fault-tolerance in a scalable manner—they cannot become a bottleneck as the number of processes and input data increase, and on the other hand, the added overhead should be small enough so that it ultimately reduces the end-to-end completion time of the user applications.
Achieved Technical Goals
Publications
Future Work
Students
Code & Data
Funding Source

465 Northwestern Avenue, West Lafayette, IN 47907 | dcsl@ecn.purdue.edu | +1 765 494 3510