Course Information Handout
ECE 60872/CS 590 – Fault-Tolerant Computer System Design
Fall 2017
The course provides an introduction to the hardware
and software methodologies for specifying, modeling and designing
fault-tolerant systems supported by case studies of real systems. The material
presents a broad spectrum of hardware and software error detection and recovery
techniques that can be used to build reliable networked systems. The lectures discuss how the hardware and
software techniques interplay, what techniques can be provided in COTS
hardware, what can be embedded into operating system and network communication
layers, and what can be provided via a distributed software layer and in the
application itself.
The course focuses on hands-on learning through the design
and development of innovative systems in the course project, which carries 50%
of the weightage for the course grade. This is reinforced by two lectures given
by practitioners from the industry who share their experiences and insights in
building dependable systems.
We use a modeling software called UltraSAN to model
a realistic system and solve the model to evaluate various dependability
properties of the system.
Note: This is not an advanced graduate level
course. Any student with a strong undergraduate CS or ECE background, i.e., one
who is able to program in at least one high level programming language and has
a basic knowledge of probability can take the class.
CRN: 20544 (to be used by ECE students
for registering for the class); 21051 (to be used by CS students for
registering for the class)
Class hours: Monday, Wednesday, and
Friday 10.30-11.20 am, EE 226
Instructor: Prof. Saurabh Bagchi,
Professor, School of Electrical and Computer Engineering (ECE) and Department
of Computer Science (CS). In addition, there will be 2 guest lectures by
practitioners from the industry.
Office,
Phone, Email:
EE 329, 765-494-3362 (Office), sbagchi@purdue.edu
Office
hours: Tuesday
and Friday 3-4
Administrative
Assistant: Mary-Ann
Satterfield, msaterfi@purdue.edu, EE 326B, 494-6389
Graduate
Course Assistants: Ran Xu (xu943@purdue.edu) and Christopher
Wright (wrigh338@purdue.edu)
They
are available to help with conceptual questions on the topics covered in the
class plus programming questions on the programming assignments and projects.
However, they are not available to code for you (you wish!).
URL: https://engineering.purdue.edu/ee695b/
Textbook:
No text book.
Reference
Books: (No
need to buy since only parts of each will be used and I will provide
photocopies of relevant portions.)
1.
I. Koren and C. Mani Krishna, Fault-tolerant
Systems, 1st edition, 2007, Morgan Kaufmann.
2.
D. P. Siewiorek and R. S. Swarz, Reliable Computer Systems - Design
and Evaluation, 3rd edition, 1998, A.K. Peters, Limited.
3.
D. K. Pradhan, ed., Fault Tolerant Computer System Design, 1st
edition, 1996, Prentice-Hall.
4.
K. Trivedi, Probability and Statistics with Reliability, Queuing and
Computer Science Applications, 2nd edition, 2001, John Wiley
& Sons.
Apart from these, the course will use technical
conference and journal papers. You are expected to get the papers from
IEEExplore or ACM Digital Library.
Course Structure:
Class
project:
There will be separate research projects that each team of 2 or 3 students will
work on. Each project will focus on one aspect of
fault-tolerant system design and will test the ability to design, model or
implement, execute experiments and perform evaluation. The target will be to produce
work that can be sent for a conference publication, which has happened with
many projects in the past. Being graduate students, it is in the best interest
of your career to build your publication record.
There will be the following phases in the project, each with their
tentative timeline.
List of suggested projects made available |
September 5 |
Project teams formed, discussion of project ideas with
instructor |
September 5-12 |
Project proposals submitted |
September 12 |
Interim project presentations (15 minutes each group) |
October 12, 14 |
Preliminary project report |
October 17 |
Final project presentations |
Last two days of class |
Final project report |
Last day of semester |
Exams: There will be a mid-term and a final exam. Each exam will be open book,
open notes, open computer. The mid-term exam will be a
1 hour exam. The final exam will be comprehensive.
Homeworks: There will be three homeworks – two written and one
programming-based. The programming-based homework will introduce a widely-used
system modeling tool called UltraSAN. You will use it to model a realistic system
and solve the model to determine the dependability characteristics of the
system. This will give you valuable exposure to how you can evaluate a system
by modeling its relevant parts.
Active
Learning Activity: We will have activities in-class where you
solve problems based on material covered in the previous week’s lectures. Some
of these would be individual and some would be group-based.
Submissions: All homework submissions will be done electronically
through Blackboard.
Dependability in the News: The class will
read articles about dependability issues in the news and will provide analysis
of these, including probable cause of the incidents and possible prevention or
remediation actions. This will highlight the connections between the
fundamental techniques we learn and their applications in the real world.
Grade
Allocation:
Course project: 50%
Mid-term: 15%
Final: 20%
Homeworks: 15%
For
reference, the class performances in the last few offerings of this course were
as follows. Fall 2016: 4 A+, 12 A, 1 A-; Fall 2015: 3 A+, 18 A, 2 A-; Fall 2014:
3 A+, 16 A, 1 A-, 1 B-.
Lecture Outline
This
is the tentative outline of coverage of topics in the class.
Introduction: Motivation,
System view of high availability design, Terminology |
2 |
Stochastic analysis of reliability ·
Discrete
distributions ·
Continuous
distributions |
6 |
Hardware redundancy: Basic
approaches, Static & Dynamic, Voting, Coding for detection and recovery · Application:
SEC-DED codes |
3 |
Error detection and
correction techniques: Watchdog processors, Heartbeats, Consistency and
capability checking, Data audits, Assertions, Control-flow checking · Application:
Erasure-coded storage |
3 |
Software fault tolerance:
Process pairs, Robust data structures, N version programming, Recovery
blocks, Replica consistency & reintegration, Multithreaded programs · Application:
Quantitative evaluation of NVP and RB |
5 |
Secure coding practices:
Principles and practice ·
Application:
Coding examples |
2 |
Network fault tolerance:
Reliable communication protocols, Agreement protocols, Byzantine fault
tolerance · Application:
Bitcoin |
6 |
Modeling · Application:
UltraSAN, Sharpe |
4 |
Checkpointing &
Recovery · Application:
SCR checkpointing system for DOE supercomputers |
4 |
Experimental Evaluation:
Simulation and Fault-injection based |
2 |
Practical Systems for Fault
Tolerance: Putting it all together · Application:
Amazon Web Service · Application:
Hadoop |
2 |
Industry presentations |
2 |
Discussion of projects |
2 |
Tests |
1 |
Total |
44 |
Course Policies
1. Academic honesty. The ECE faculty expect every
member of the Purdue community to practice honorable and ethical behavior both
inside and outside the classroom. Any actions that might unfairly improve a
student’s score on homework, quizzes, or examinations will be considered
cheating and will not be tolerated.
Examples of cheating include (but are not limited
to):
·
Sharing results or other information during an examination.
·
Turning in someone else’s work (apart from project partner’s) as
results on the project
·
Submitting homework that is not your own work or engaging in forbidden
homework collaborations.
·
Requesting a regrade of answers or work that has been altered.
Cheating on an assignment or examination will result
in a failing grade for the course. All occurrences of academic dishonesty will
be reported to the Assistant Dean of Students and copied to the ECE Associate
Head for Education. If there is any question as to whether a given action might
be construed as cheating, please see the instructor before you engage in any
such action.
2. Homework/Projects. Please submit your homeworks and projects by the due date and time. Failure to do so will result in a penalty of 10% of the grade on the assignment for each hour it is late. Beyond 5 hours, you will not get any credit for the assignment.
The assignments will be returned electronically within
a week with instructor comments.
3. Regrade Requests. Exams and homeworks may be submitted for
regrading up to one week after they are returned to the class. To request a
regrade, write an explanation of your request on a separate sheet of paper and
attach it to the homework or the exam, then give it to the professor. A regrade
request may increase or decrease your grade.
4. Feedback. I actively solicit positive and negative feedback
throughout the course. If you have a complaint about how the course is taught or
organized, constructive feedback on what would work better for you, or topics
that you would want to see covered in the course, please send e-mail feedback
to sbagchi@purdue.edu any time during the semester or afterwards. Feedback will
in no way negatively influence your grade—thoughtful feedback both positive and
negative is much appreciated. Anonymous feedback can also be given through a form
that is accessible through the course web page. This does not track any
personally identifiable information. In addition, an interim written evaluation
will be collected to make any mid-stream adjustments in the class.
5. Extraordinary Events. In the event of a major
campus emergency, course requirements, deadlines and grading percentages are
subject to changes that may be necessitated by a revised semester calendar or
other circumstances. In such an event, information will be provided through
Blackboard Learn.