Home Projects Publications Presentations People News Activities About DCSL Internal
 
<< All Projects Fault Tolerance for High-Performance Computing Clusters and Applications
Summary

As today's distributed commercial and scientific applications increase in complexity and scale, providing fault tolerance capabilities becomes increasingly difficult. Faults can arise from multiple sources—such as software bugs, hardware errors and unexpected runtime conditions—and can affect an application in different phases of its execution.  The increase in size of the largest supercomputers and data centers on which these applications run imposes challenges to fault-tolerance techniques such as checkpointing and fault detection and localization. On one hand, these techniques need to provide fault-tolerance in a scalable manner—they cannot become a bottleneck as the number of processes and input data increase, and on the other hand, the added overhead should be small enough so that it ultimately reduces the end-to-end completion time of the user applications.

 

 

Achieved Technical
Goals
Publications
Future Work
Students
Code & Data
Funding Source
 
 
465 Northwestern Avenue, West Lafayette, IN 47907   |  dcsl@ecn.purdue.edu   |  +1 765 494 3510
Home |  Projects  |  Publications  |  Presentations  |  People
News  |  Activities |  About DCSL  |  Internal


Last Update: March 19, 2012 12:15 by GMHoward