Course Outline
ECE 60872/CS 59000 – Fault-Tolerant Computer System Design
Fall 2019
The course provides an introduction to the hardware
and software methodologies for specifying, modeling and designing
fault-tolerant systems, supported by case studies of real systems. The material
presents a broad spectrum of hardware and software error detection and recovery
techniques that can be used to build reliable networked systems. The lectures discuss how the hardware and
software techniques interplay, what techniques can be provided in COTS
hardware, what can be embedded into operating system and network communication
layers, and what can be provided via a distributed software layer and in the
application itself. It brings in the emerging role of data analytics in
building and operating reliable systems.
The course focuses on hands-on learning through the design and
development of innovative systems in the course project, which carries 50% of
the weightage for the course grade.
This is reinforced by two lectures given by
practitioners from the industry who share their experiences and insights in
building dependable systems.
Note: This is not an
advanced graduate level course. Any student with a strong undergraduate CS or
ECE background, i.e., one who is able to program in at least one high level
programming language and has a basic knowledge of probability can take the
class.
New this year
We will devote multiple
lectures on the emerging topic of big data for reliability and security. This
will cover some of the fundamental algorithms and 2 large use cases with
real-world data.
Course Structure:
Class
project:
There will be different research projects that each team of 2 or 3 students
will work on. Each project will focus on one aspect
of fault-tolerant system design and will test the ability to design, model or
implement, execute experiments and perform evaluation. The target will be to produce
work that can be sent for a conference publication, which has happened with
many projects in the past. Being graduate students, it is in the best interest
of your career to build your publication record.
There will be the following phases in the project, each with their
tentative timeline.
List of suggested projects made available |
September 3 |
Project teams formed, discussion of project ideas with
instructor |
September 3-10 |
Project proposals submitted |
September 12 |
Interim project presentations (15 minutes each group) |
October 10, 12 |
Preliminary project report |
October 15 |
Final project presentations |
Last two days of class |
Final project report |
Last day of semester |
Lecture Outline
This
is the tentative outline of coverage of topics in the class. The new lectures this
year are in red.
Introduction: Motivation, System view of high
availability design, Terminology |
2 |
Stochastic
analysis of reliability �
Discrete
distributions �
Continuous
distributions |
6 |
Hardware redundancy: Basic approaches, Static & Dynamic,
Voting, Coding for detection and recovery � Application:
SEC-DED codes |
3 |
Error detection and correction techniques: Watchdog
processors, Heartbeats, Consistency and capability checking, Data audits,
Assertions, Control-flow checking � Application:
Erasure-coded storage |
3 |
Software fault tolerance: Process pairs, Robust data
structures, N version programming, Recovery blocks, Replica consistency &
reintegration, Multithreaded programs � Application:
Quantitative evaluation of NVP and RB |
3 |
Secure coding practices: Principles and practice �
Application:
Coding examples |
2 |
Network fault tolerance: Reliable communication
protocols, Agreement protocols, Byzantine fault tolerance � Application: Bitcoin |
5 |
Big data for reliability �
Application: Failure analysis of Purdue compute
clusters |
2 |
Big data for security �
Application: ML analysis of ransomware |
2 |
Modeling � Application:
UltraSAN, Sharpe |
2 |
Checkpointing & Recovery � Application:
SCR checkpointing system for DOE supercomputers |
3 |
Experimental Evaluation: Simulation and
Fault-injection based |
2 |
Practical Systems for Fault Tolerance: Putting it
all together � Application:
Amazon Web Service � Application:
Hadoop |
2 |
Industry presentations |
2 |
Discussion of projects |
2 |
Project presentations |
2 |
Tests |
1 |
Total |
44 |