ECE 60872 - Reliable and Secure Computer Systems

Note:

I accompany each topic with examples of how the topic has found use in real-world systems. I expect to have at least one invited speaker come address the class with how he/she has designed and used fault-tolerant systems. An important focus of the class is the semester-long project on a cutting-edge problem. I expect that with some additional work after the semester, the successful projects will lead to publications, as it has for a majority of the projects in past offerings.

Course Details

Lecture Hours: 3 Credits: 3

Areas of Specialization:

  • Computer Engineering

Counts as:

Normally Offered:

Each Spring

Campus/Online:

On-campus only

Requisites:

None

Requisites by Topic:

Any one high level programming language experience: a basic background in probability

Catalog Description:

The course provides an introduction to the hardware and software methodologies for specifying, modeling and designing fault-tolerant systems supported by case studies of real systems. The material presents a broad spectrum of hardware and software error detection and recovery techniques that can be used to build reliable networked systems. The lectures discuss how the hardware and software techniques interplay, what techniques can be provided in COTS hardware, what can be embedded into operating system and network communication layers, and what can be provided via a distributed software layer and in the application itself.

Required Text(s):

  1. Fault-Tolerant Systems , Israel Kohen and C. Mani Krishna , Morgan Kaufmann , 2007 , ISBN No. 9780120885251

Recommended Text(s):

  1. Fault Tolerance in Distributed Systems , Pankaj Jalote , Prentice Hall , 1994
  2. Probability and Statistics with Reliability, Queuing and Computer Science Applications , 2nd Edition , Kishor Trivedi , John Wiley & Sons , 2001

Learning Outcomes

A student who successfully fulfills the course requirements will have demonstrated an ability to:

  • Formulate design principles for dependable systems.
  • Analyze mathematically a system for its dependability properties.
  • Analyze mathematically a system for its dependability properties.
  • Understand and apply software dependability techniques.
  • Understand and apply networking dependability techniques.
  • Design, develop, and present an innovative project on dependable systems.

Lecture Outline:

Lectures Major Topics
2 Introduction: Motivation, System view of high availability design, Terminology
6 Stochastic analysis of reliability: discrete distributions, continuous distributions
3 Hardware redundancy: basic approaches, static and dynamic, voting, coding for detection and recovery. Application: SEC-DED codes
3 Error detection techniques: watchdog processors, heartbeats, consistency and capability checking, data audits, assertions, control-flow checking. Application: Dynamic Host Configuration Protocol (DHCP)
5 Software fault tolerance: process pairs, robust data structures, N version programming, recovery blocks, replica consistency and reintegration, multithreaded programs. Application: Quantitative evaluation of NVP and RB
2 Secure coding practices: principles and practice. Application: Coding examples
6 Network fault tolerance: reliable communication protocols, byzantine fault tolerance. Application: bitcoin
4 Modeling. Application: UltraSan, Sharpe
4 Checkpointing and recovery. Application: Microcheckpointing
2 Experimental evaluation: simulation and fault-injection based
2 Practical systems for fault tolerance: putting it all together. Application: Amazon web service. Application: Hadoop
2 Industry presentations
2 Discussion of projects
1 Tests