ECE 69500 - Design of Fault-Tolerant Computer Systems

Note:

Spring 2009 CRN 17326

Course Details

Lecture Hours: 3 Credits: 3

Counts as:

Experimental Course Offered:

Spring 2009

Catalog Description:

An introduction to the hardware and software methodologies for specifying, modeling and designing fault-tolerant systems supported by case studies of real systems. The material presents a broad spectrum of hardware and software error detection and recovery techniques that can be used to build reliable networked systems. The lectures discuss how the hardware and software techniques interplay, what techniques can be provided in COTS hardware, what can be embedded into operating system and network communication layers, and what can be provided via a distributed software layer and in the application itself. The course stresses on the application of concepts learned through a semester-long course project focusing on an unsolved problem in the field. The course also uses modeling software to model realistic systems and evaluate their dependability properties. It

Required Text(s):

  1. Reliable Computer Systems - Design and Evaluation , 3rd Edition , D.P. Siewiorek and R.S. Swarz , A.K. Peters , 1999 , ISBN No. 1-56881-092-X

Recommended Text(s):

  1. Fault tolerant Computer System Design , 1st Edition , D.K. Pradhan , Prentice-Hall , 1996
  2. Probability and Statistics with Reliability, Queuing and Computer Science Applications , 2nd Edition , K. Trivedi , John Wiley & Sons , 2001

Lecture Outline:

Lectures Topics
2 Introduction: Motivation, System view of high availability design, Two real system examples (Stratus and Chameleon)
4 Statistical distributions and their use for reliability modeling
3 Basic approaches to hardware redundancy
4 Error detection techniques: Watchdog processors, Heartbeats, Consistency and capability checking, Data audits, Assertions, Control-flow checking
5 Software fault tolerance: Process pairs, Robust data structures, N version programming, Recovery blocks, Replica consistency & reintegration, Multithreaded programs
6 Network fault tolerance: Reliable communication protocols, Agreement protocols, Database commit protocols
4 Practical steps in design of high availability networked systems Examples: Web services, Highly available clusters (a la Google file system)
4 Checkpointing and recovery
4 Experimental Evaluation: Modeling and simulation based, Fault injection based Modeling tools: Sharpe, UltraSAN
3 Practical Systems for Fault Tolerance: Putting it all together Application: Electronic banking Application: NASA Remote Exploration & Experimentation System
3 Discussion of projects
2 Presentation of projects
1 Mid-term examination

Assessment Method:

none