Course Information Handout
ECE 60872: Fault-Tolerant Computer System Design
More accurately: Dependable and Secure Computing
Spring 2024
The course introduces the hardware and software
methodologies for specifying, modeling and designing fault-tolerant systems,
supported by case studies of real systems. The material presents a broad
spectrum of hardware and software error detection and recovery techniques that can
be used to build reliable networked systems.
The lectures discuss how the hardware and software techniques interplay,
what techniques can be provided in COTS hardware, what can be embedded into
operating system and network communication layers, and what can be provided via
a distributed software layer and in the application itself. It brings in the
emerging role of data analytics in building and operating reliable systems.
The course focuses on hands-on learning through the design and
development of innovative systems in the course project, which carries 50% of
the weightage for the course grade.
This is reinforced by two lectures given by
practitioners from the industry who share their experiences and insights in
building dependable systems.
Note: This is not an
advanced graduate level course. Any student with a strong undergraduate CS or
ECE background, i.e., one who is able to program in at least one high level
programming language and has a basic knowledge of probability can take the
class.
We will devote multiple
lectures on the emerging topic of big data for reliability and security. This
will cover some of the fundamental algorithms and 2 large use cases with
real-world data.
Class hours: Monday, Wednesday, and
Friday 3:30-4:20 am, EE 236
Instructor: Prof. Saurabh Bagchi,
Professor, School of Electrical and Computer Engineering (ECE) and Department
of Computer Science (CS). In addition, there will be 2 guest lectures by
practitioners from the industry.
Office,
Phone, Email:
EE 325, 765-494-1741 (Office), sbagchi@purdue.edu
Office
hours: Tuesday
10-11 am and Friday 4:30-5:30 pm
Administrative
Assistant: Mary-Ann
Satterfield, msaterfi@purdue.edu, EE 326B, 494-6389
Graduate
Course Assistant: Preeti Mukherjee (mukher57@purdue.edu)
The
TA is available to help with conceptual questions on the topics covered in the
class plus programming questions on the programming assignments and projects.
However, they are not available to hand hold you with your coding problems (you
wish!).
URL: https://engineering.purdue.edu/ftc/
Piazza:
Sign up through https://piazza.com/purdue/spring2024/ece60872
Textbook:
No text book.
Reference
Books: (No
need to buy since only parts of each will be used and I will provide
photocopies of relevant portions.)
1.
William Stallings, Computer
Security: Principles and Practice, 4th edition, Pearson.
2.
D. P. Siewiorek and R. S. Swarz,
Reliable Computer Systems - Design and Evaluation, 3rd
edition, 1998, A.K. Peters, Limited.
3.
D. K. Pradhan, ed., Fault Tolerant Computer System Design, 1st
edition, 1996, Prentice-Hall.
4.
K. Trivedi, Probability and Statistics with Reliability, Queuing and
Computer Science Applications, 2nd edition, 2001, John Wiley
& Sons.
Apart from these, the course will use technical
conference and journal papers. You are expected to get the papers from IEEExplore or ACM Digital Library.
Grade
Allocation:
Course project: 50%
Mid-term: 15%
Final: 20%
Homeworks:
15%
For
reference, the class grades in the last 3 offerings is 90% of the class gets A+
or A.
Course Structure:
Class
project:
There will be different research projects that each team of 2 or 3 students
will work on. Each project will focus on one aspect
of fault-tolerant system design and will test the ability to design, model or
implement, execute experiments and perform evaluation. The target will be to produce
work that can be sent for a conference publication, which has happened with
many projects in the past. Being graduate students, it is in the best interest
of your career to build your publication record.
There will be the following phases in the project, each with their
tentative timeline.
List of suggested projects made available |
February 5 |
Project teams formed, discussion of project ideas with
instructor |
February 6-10 |
Project proposals submitted |
February 12 |
Interim project presentations |
March 18, 20 |
Preliminary project report |
March 25 |
Final project presentations |
Last two days of class |
Final project report |
Last day of semester |
Exams: There will be one mid-term and one final exam. Each exam will be open
book, open notes, open computer. The mid-term exam
will be a 1 hour exam. The final exam will be
comprehensive.
The mid-term exam will be on March 6 (Wed).
Homeworks: There will be
three homeworks – two written and one
programming-based. The programming-based homework will introduce a widely-used
system modeling tool called UltraSAN. You will use it
to model a realistic system and solve the model to determine the dependability
characteristics of the system. This will give you valuable exposure to how you
can evaluate a system by modeling its relevant parts.
Active
Learning Activity: We will have activities in-class where you
solve problems based on material covered in the previous week’s lectures. Some
of these would be individual and some would be group-based.
Submissions: All homework submissions will be done electronically
through Brightspace.
Dependability in the News: The class will
read articles about dependability issues in the news and will provide analysis
of these, including probable cause of the incidents and possible prevention or
remediation actions. This will highlight the connections between the
fundamental techniques we learn and their applications in the real world.
Lecture Outline
This
is the tentative outline of coverage of topics in the class.
Introduction: Motivation, System view of high
availability design, Terminology |
2 |
Stochastic
analysis of reliability ·
Discrete
distributions ·
Continuous
distributions |
6 |
Hardware redundancy: Basic approaches, Static &
Dynamic, Voting, Coding for detection and recovery · Application:
SEC-DED codes |
3 |
Software fault tolerance: Process pairs, Robust data
structures, N version programming, Recovery blocks, Replica consistency &
reintegration, Multithreaded programs · Application:
Quantitative evaluation of NVP and RB |
3 |
Secure coding practices: Principles and practice ·
Application:
Coding examples |
2 |
Network fault tolerance: Reliable communication
protocols, Agreement protocols, Byzantine fault tolerance · Application:
Bitcoin |
5 |
Big data for reliability ·
Application:
Failure analysis of Purdue compute clusters |
4 |
Big data for security ·
Application:
ML analysis of ransomware |
3 |
Modeling · Application:
UltraSAN, Sharpe |
2 |
Checkpointing & Recovery · Application:
SCR checkpointing system for DOE supercomputers |
3 |
Experimental Evaluation: Simulation and
Fault-injection based |
2 |
Practical Systems for Fault Tolerance: Putting it
all together · Application:
Amazon Web Service · Application:
New York Stock Market |
2 |
Industry presentations |
2 |
Discussion of projects |
2 |
Project presentations |
2 |
Tests |
1 |
Total |
44 |
Course Policies
1. Academic honesty. The ECE faculty expect every
member of the Purdue community to practice honorable and ethical behavior both
inside and outside the classroom. Any actions that might unfairly improve a
student’s score on homework, quizzes, or examinations will be considered
cheating and will not be tolerated.
Examples of cheating include (but are not limited
to):
·
Sharing results or other information during an examination.
·
Turning in someone else’s work (apart from project partner’s) as
results on the project
·
Submitting homework that is not your own work or engaging in forbidden
homework collaborations.
·
Requesting a regrade of answers or work that has been altered.
Cheating on an assignment or examination will result
in a failing grade for the course. All occurrences of academic dishonesty will
be reported to the Assistant Dean of Students and copied to the ECE Associate
Head for Education. If there is any question as to whether a given action might
be construed as cheating, please see the instructor before you engage in any
such action.
2. Homework/Projects. Please submit your homeworks and projects by the due date and time. Failure to do so will result in a penalty of 10% of the grade on the assignment for each hour it is late. Beyond 5 hours, you will not get any credit for the assignment. The assignments will be returned electronically within a week with instructor comments.
3. Regrade Requests. Exams and homeworks
may be submitted for regrading up to one week after they are returned to the
class. To request a regrade, write an explanation of your request on a separate
sheet of paper and attach it to the homework or the exam, then give it to the
professor. A regrade request may increase or decrease your grade.
4. Feedback. I actively solicit positive and negative feedback
throughout the course, including anonymous feedback. If you have a complaint
about how the course is taught or organized, constructive feedback on what
would work better for you, or topics that you would want to see covered in the
course, please send e-mail feedback to sbagchi@purdue.edu any time during the
semester or afterwards. Feedback will in no way negatively influence your
grade—thoughtful feedback both positive and negative is much appreciated. Anonymous
feedback can also be given through a form that is accessible through the course
web page. All such feedback will be used to make any mid-stream adjustments in
the class.
5. Extraordinary Events. In the event of a major
campus emergency, course requirements, deadlines and grading percentages are
subject to changes that may be necessitated by a revised semester calendar or
other circumstances. In such an event, information will be provided through Brightspace.
Administrative
information common to Purdue
Attendance Policy
This course follows Purdue’s academic regulations
regarding attendance, which states that students are expected to be present for
every meeting of the classes in which they are enrolled. Attendance will be
taken at the beginning of each class and lateness will be noted. When conflicts
or absences can be anticipated, such as for many University-sponsored
activities and religious observations, the student should inform the instructor
of the situation as far in advance as possible. For unanticipated or emergency
absences when advance notification to the instructor is not possible, the
student should contact the instructor as soon as possible by email or phone.
When the student is unable to make direct contact with the instructor and is
unable to leave word with the instructor’s department because of circumstances
beyond the student’s control, and in cases falling under excused absence
regulations, the student or the student’s representative should contact or go
to the Office of the
Dean of Students website to complete appropriate forms for instructor notification. Under
academic regulations, excused absences may be granted for cases of
grief/bereavement, military service, jury duty, and parenting leave. For
details, see the Academic
Regulations & Student Conduct section of the University Catalog website.
Guidance
on class attendance related to COVID-19 are outlined in the Protect Purdue Pledge for Fall 2021 on the Protect Purdue
website.
Academic Guidance in
Event of Quarantine/Isolation:
If you must
miss class at any point in time during the semester, please reach out to me via
Purdue email so that we can communicate about how you can maintain your
academic progress. If you find yourself too sick to progress in the course,
notify your adviser and notify me via email or Brightspace. We will make
arrangements based on your particular situation. Please note that, according to
Details for Students on Normal Operations for Fall
2021 announced on the Protect Purdue website,
“individuals who test positive for COVID-19 are not guaranteed remote access to
all course activities, materials, and assignments.”