ECE 60872 - Fault-Tolerant Computer System Design

Lecture Hours: 3 Credits: 3

Areas of Specialization(s):

Counts as:

Normally Offered: Each Fall


Requisites by Topic:
Any one high level programming language experience: a basic background in probability

Catalog Description:
The course provides an introduction to the hardware and software methodologies for specifying, modeling and designing fault-tolerant systems supported by case studies of real systems. The material presents a broad spectrum of hardware and software error detection and recovery techniques that can be used to build reliable networked systems. The lectures discuss how the hardware and software techniques interplay, what techniques can be provided in COTS hardware, what can be embedded into operating system and network communication layers, and what can be provided via a distributed software layer and in the application itself. The course focuses on hands-on learning through the design and development of systems in the course project, which carries 50% of the weightage for the course grade. We will be using a modeling software called UltraSAN to model a realistic system and then learn how the model can be solved to evaluate various dependability properties of the system.

Supplementary Information:
I accompany each topic with examples of how the topic has found use in real-world systems. I expect to have at least one invited speaker come address the class with how he/she has designed and used fault-tolerant systems. An important focus of the class is the semester-long project on a cutting-edge problem. I expect that with some additional work after the semester, the successful projects will lead to publications, as it has for a majority of the projects in past offerings.

Required Text(s):
  1. Fault-Tolerant Systems, Israel Kohen and C. Mani Krishna, Morgan Kaufmann, 2007, ISBN No. 9780120885251.
Recommended Text(s):
  1. Fault Tolerance in Distributed Systems, Pankaj Jalote, Prentice Hall, 1994.
  2. Probability and Statistics with Reliability, Queuing and Computer Science Applications, 2nd Edition, Kishor Trivedi, John Wiley & Sons, 2001.

Lecture Outline:

Lectures Major Topics
2 Introduction: Motivation, System view of high availability design, Terminology
6 Stochastic analysis of reliability: discrete distributions, continuous distributions
3 Hardware redundancy: basic approaches, static and dynamic, voting, coding for detection and recovery. Application: SEC-DED codes
3 Error detection techniques: watchdog processors, heartbeats, consistency and capability checking, data audits, assertions, control-flow checking. Application: Dynamic Host Configuration Protocol (DHCP)
5 Software fault tolerance: process pairs, robust data structures, N version programming, recovery blocks, replica consistency and reintegration, multithreaded programs. Application: Quantitative evaluation of NVP and RB
2 Secure coding practices: principles and practice. Application: Coding examples
6 Network fault tolerance: reliable communication protocols, byzantine fault tolerance. Application: bitcoin
4 Modeling. Application: UltraSan, Sharpe
4 Checkpointing and recovery. Application: Microcheckpointing
2 Experimental evaluation: simulation and fault-injection based
2 Practical systems for fault tolerance: putting it all together. Application: Amazon web service. Application: Hadoop
2 Industry presentations
2 Discussion of projects
1 Tests