ECE 57200 - Fault Tolerant Computer Systems

Course Details

Lecture Hours: 3 Credits: 3

Areas of Specialization:

  • Education

Counts as:

  • EE Elective

Normally Offered:

Spring - odd years

Requisites:

ECE 30200 and 36800

Catalog Description:

An introduction to the hardware and software methodologies for specifying, modeling and designing fault-tolerant systems supported by case studies of real systems. The material presents a broad spectrum of hardware and software error detection and recovery techniques that can be used to build reliable networked systems. The lectures discuss how the hardware and software techniques interplay, what techniques can be provided in COTS hardware, what can be embedded into operating system and network communication layers, and what can be provided via a distributed software layer and in the application itself. The course uses modeling software to model realistic systems and evaluate their dependability properties.

Required Text(s):

  1. Reliable Computer Systems - Design and Evaluation , 3rd Edition , D.P. Siewiorek and R.S. Swarz , A.K. Peters, Limited , 1999

Recommended Text(s):

  1. Fault Tolerant Computer System Design , 1st Edition , D.K. Pradhan, ed. , Prentice-Hall , 1996
  2. Probability and Statistics with Reliability, Queuing and Computer Science Applications , 2nd Edition , K. Trivedi , John Wiley & Sons , 2001

Learning Outcomes

A student who successfully fulfills the course requirements will have demonstrated:

  • an ability to evaluate the dependability of a system
  • an ability to analyze a system for performance-dependability tradeoffs
  • an ability to select the appropriate detection techniques (hardware and software) for a given environment
  • an ability to select the appropriate recovery techniques (hardware and software) for a given environment
  • an ability to select the appropriate points in an end-to-end system to embed fault-tolerant techniques

Lecture Outline:

Lectures Topics
2 Introduction: Motivation, System view of high availability design, Two commercial examples (Stratus and Chameleon)
3 Probability review, distributions
3 Hardware redundancy: Basic approaches, Static & Dynamic, Voting, Fault tolerant interconnection networks
5 Error detection techniques: Watchdog processors, Heartbeats, Consistency and capability checking, Data audits, Assertions, Control-flow checking
5 Software fault tolerance: Process pairs, robust data structures, N version programming, Recovery blocks, Replica consistency & reintegration, Multithreaded programs
6 Network fault tolerance: Reliable communication protocols, Agreement protocols, Database commit protocols
3 Practical steps in design of high availability networked systems
5 Checkpointing and recovery
4 Experimental Evaluation: Modeling and simulation based, Fault injection based
3 Practical Systems for Fault Tolerance: Putting it all together
3 Discussion of projects
2 Presentation of projects
1 Tests

Assessment Method:

Student assessment of the course outcomes will be in the form of a midterm exam, a final exam, and the grading of a design and implementation project. Each student working in a group of two will choose a project from a list. Each project will focus on one aspect of fault -tolerant system design and will test the ability to design model or implement, execute experiments and perform evaluation.