Course Information Handout

ECE 60872: Fault-Tolerant Computer System Design

More accurately: Dependable and Secure Computing

 

Purdue University

Spring 2024

 

The course introduces the hardware and software methodologies for specifying, modeling and designing fault-tolerant systems, supported by case studies of real systems. The material presents a broad spectrum of hardware and software error detection and recovery techniques that can be used to build reliable networked systems.  The lectures discuss how the hardware and software techniques interplay, what techniques can be provided in COTS hardware, what can be embedded into operating system and network communication layers, and what can be provided via a distributed software layer and in the application itself. It brings in the emerging role of data analytics in building and operating reliable systems.

The course focuses on hands-on learning through the design and development of innovative systems in the course project, which carries 50% of the weightage for the course grade.

This is reinforced by two lectures given by practitioners from the industry who share their experiences and insights in building dependable systems.

 

Note: This is not an advanced graduate level course. Any student with a strong undergraduate CS or ECE background, i.e., one who is able to program in at least one high level programming language and has a basic knowledge of probability can take the class.

We will devote multiple lectures on the emerging topic of big data for reliability and security. This will cover some of the fundamental algorithms and 2 large use cases with real-world data.

 

 

 

Class hours: Monday, Wednesday, and Friday 3:30-4:20 am, EE 236

 

Instructor: Prof. Saurabh Bagchi, Professor, School of Electrical and Computer Engineering (ECE) and Department of Computer Science (CS). In addition, there will be 2 guest lectures by practitioners from the industry.

 

Office, Phone, Email: EE 325, 765-494-1741 (Office), sbagchi@purdue.edu

Office hours: Tuesday 10-11 am and Friday 4:30-5:30 pm

 

Administrative Assistant: Mary-Ann Satterfield, msaterfi@purdue.edu, EE 326B, 494-6389

           

Graduate Course Assistant: Preeti Mukherjee (mukher57@purdue.edu)

The TA is available to help with conceptual questions on the topics covered in the class plus programming questions on the programming assignments and projects. However, they are not available to hand hold you with your coding problems (you wish!).

 

URL: https://engineering.purdue.edu/ftc/

 

Piazza: Sign up through https://piazza.com/purdue/spring2024/ece60872

 

Textbook: No text book.

 

Reference Books:  (No need to buy since only parts of each will be used and I will provide photocopies of relevant portions.)

1.                   William Stallings, Computer Security: Principles and Practice, 4th edition, Pearson.  

2.                   D. P. Siewiorek and R. S. Swarz, Reliable Computer Systems - Design and Evaluation, 3rd edition, 1998, A.K. Peters, Limited.

3.                   D. K. Pradhan, ed., Fault Tolerant Computer System Design, 1st edition, 1996, Prentice-Hall.

4.                   K. Trivedi, Probability and Statistics with Reliability, Queuing and Computer Science Applications, 2nd edition, 2001, John Wiley & Sons.

Apart from these, the course will use technical conference and journal papers. You are expected to get the papers from IEEExplore or ACM Digital Library.

 

Grade Allocation:

            Course project: 50%

            Mid-term: 15%

            Final: 20%

            Homeworks: 15%

 

For reference, the class grades in the last 3 offerings is 90% of the class gets A+ or A.

 

Course Structure:

Class project: There will be different research projects that each team of 2 or 3 students will work on. Each project will focus on one aspect of fault-tolerant system design and will test the ability to design, model or implement, execute experiments and perform evaluation. The target will be to produce work that can be sent for a conference publication, which has happened with many projects in the past. Being graduate students, it is in the best interest of your career to build your publication record.

There will be the following phases in the project, each with their tentative timeline.

List of suggested projects made available

February 5

Project teams formed, discussion of project ideas with instructor

February 6-10

Project proposals submitted

February 12

Interim project presentations

March 18, 20

Preliminary project report

March 25

Final project presentations

Last two days of class

Final project report

Last day of semester

Exams: There will be one mid-term and one final exam. Each exam will be open book, open notes, open computer. The mid-term exam will be a 1 hour exam. The final exam will be comprehensive.

The mid-term exam will be on March 6 (Wed).

Homeworks: There will be three homeworks – two written and one programming-based. The programming-based homework will introduce a widely-used system modeling tool called UltraSAN. You will use it to model a realistic system and solve the model to determine the dependability characteristics of the system. This will give you valuable exposure to how you can evaluate a system by modeling its relevant parts.

Active Learning Activity: We will have activities in-class where you solve problems based on material covered in the previous week’s lectures. Some of these would be individual and some would be group-based.

Submissions: All homework submissions will be done electronically through Brightspace.

Dependability in the News: The class will read articles about dependability issues in the news and will provide analysis of these, including probable cause of the incidents and possible prevention or remediation actions. This will highlight the connections between the fundamental techniques we learn and their applications in the real world.   

 

Lecture Outline

This is the tentative outline of coverage of topics in the class.

Introduction: Motivation, System view of high availability design, Terminology

2

Stochastic analysis of reliability

·         Discrete distributions

·         Continuous distributions

6

Hardware redundancy: Basic approaches, Static & Dynamic, Voting, Coding for detection and recovery

·        Application: SEC-DED codes

3

Software fault tolerance: Process pairs, Robust data structures, N version programming, Recovery blocks, Replica consistency & reintegration, Multithreaded programs

·        Application: Quantitative evaluation of NVP and RB

3

Secure coding practices: Principles and practice

·         Application: Coding examples

2

Network fault tolerance: Reliable communication protocols, Agreement protocols, Byzantine fault tolerance

·        Application: Bitcoin

5

Big data for reliability

·         Application: Failure analysis of Purdue compute clusters

4

Big data for security

·         Application: ML analysis of ransomware

3

Modeling

·        Application: UltraSAN, Sharpe

2

Checkpointing & Recovery

·        Application: SCR checkpointing system for DOE supercomputers

3

Experimental Evaluation: Simulation and Fault-injection based

2

Practical Systems for Fault Tolerance: Putting it all together

·        Application: Amazon Web Service

·        Application: New York Stock Market

2

Industry presentations

2

Discussion of projects

2

Project presentations

2

Tests

1

Total

44

 

Course Policies

1. Academic honesty. The ECE faculty expect every member of the Purdue community to practice honorable and ethical behavior both inside and outside the classroom. Any actions that might unfairly improve a student’s score on homework, quizzes, or examinations will be considered cheating and will not be tolerated.

Examples of cheating include (but are not limited to):

·         Sharing results or other information during an examination.

·         Turning in someone else’s work (apart from project partner’s) as results on the project

·         Submitting homework that is not your own work or engaging in forbidden homework collaborations.

·         Requesting a regrade of answers or work that has been altered.

Cheating on an assignment or examination will result in a failing grade for the course. All occurrences of academic dishonesty will be reported to the Assistant Dean of Students and copied to the ECE Associate Head for Education. If there is any question as to whether a given action might be construed as cheating, please see the instructor before you engage in any such action.

 

2. Homework/Projects. Please submit your homeworks and projects by the due date and time. Failure to do so will result in a penalty of 10% of the grade on the assignment for each hour it is late. Beyond 5 hours, you will not get any credit for the assignment. The assignments will be returned electronically within a week with instructor comments.

 

3. Regrade Requests. Exams and homeworks may be submitted for regrading up to one week after they are returned to the class. To request a regrade, write an explanation of your request on a separate sheet of paper and attach it to the homework or the exam, then give it to the professor. A regrade request may increase or decrease your grade.

 

4. Feedback. I actively solicit positive and negative feedback throughout the course, including anonymous feedback. If you have a complaint about how the course is taught or organized, constructive feedback on what would work better for you, or topics that you would want to see covered in the course, please send e-mail feedback to sbagchi@purdue.edu any time during the semester or afterwards. Feedback will in no way negatively influence your grade—thoughtful feedback both positive and negative is much appreciated. Anonymous feedback can also be given through a form that is accessible through the course web page. All such feedback will be used to make any mid-stream adjustments in the class.

 

5. Extraordinary Events. In the event of a major campus emergency, course requirements, deadlines and grading percentages are subject to changes that may be necessitated by a revised semester calendar or other circumstances. In such an event, information will be provided through Brightspace.


 

Administrative information common to Purdue

 

Attendance Policy

 

This course follows Purdue’s academic regulations regarding attendance, which states that students are expected to be present for every meeting of the classes in which they are enrolled. Attendance will be taken at the beginning of each class and lateness will be noted. When conflicts or absences can be anticipated, such as for many University-sponsored activities and religious observations, the student should inform the instructor of the situation as far in advance as possible. For unanticipated or emergency absences when advance notification to the instructor is not possible, the student should contact the instructor as soon as possible by email or phone. When the student is unable to make direct contact with the instructor and is unable to leave word with the instructor’s department because of circumstances beyond the student’s control, and in cases falling under excused absence regulations, the student or the student’s representative should contact or go to the Office of the Dean of Students website to complete appropriate forms for instructor notification. Under academic regulations, excused absences may be granted for cases of grief/bereavement, military service, jury duty, and parenting leave. For details, see the  Academic Regulations & Student Conduct section of the University Catalog website.

Guidance on class attendance related to COVID-19 are outlined in the Protect Purdue Pledge for Fall 2021 on the Protect Purdue website.

 

Academic Guidance in Event of Quarantine/Isolation:

 

If you must miss class at any point in time during the semester, please reach out to me via Purdue email so that we can communicate about how you can maintain your academic progress. If you find yourself too sick to progress in the course, notify your adviser and notify me via email or Brightspace. We will make arrangements based on your particular situation. Please note that, according to Details for Students on Normal Operations for Fall 2021 announced on the Protect Purdue website, “individuals who test positive for COVID-19 are not guaranteed remote access to all course activities, materials, and assignments.”