Course Information Handout
ECE 695 – Fault-Tolerant Computer System Design
School of Electrical and Computer Engineering
Purdue University
Fall 2014

The course provides an introduction to the hardware and software methodologies for specifying, modeling and designing fault-tolerant systems supported by case studies of real systems. The material presents a broad spectrum of hardware and software error detection and recovery techniques that can be used to build reliable networked systems.  The lectures discuss how the hardware and software techniques interplay, what techniques can be provided in COTS hardware, what can be embedded into operating system and network communication layers, and what can be provided via a distributed software layer and in the application itself.
The course focuses on hands-on learning through the design and development of systems in the course project, which carries 50% of the weightage for the course grade.
We will be using a modeling software called UltraSAN to model a realistic system and then learn how the model can be solved to evaluate various dependability properties of the system.

CRN: 64769 (to be used by ECE students for registering for the class); 65126 (to be used by CS students for registering for the class)

Class hours: Monday, Wednesday, Friday 9.30-10.20 EE 224

Instructor: Prof. Saurabh Bagchi, Professor, School of Electrical and Computer Engineering. In addition, there will be 2 guest lectures by leading practitioners from the industry and 2 guest lectures by Prof. Patrick Eugster, Purdue CS.

Office, Phone, Email: EE 329, 49-43362, sbagchi@purdue.edu
Office hours: Tuesday and Friday 3-4

Administrative Assistant: Wanitta Thompson, thompsow@purdue.edu, EE 326B, 49-46389
           
Graduate Course Assistants: Nathan Burow (nburow@purdue.edu; 494-9462; EE 338) and Kanak Mahadik (kmahadik@purdue.edu; 494-3365; EE 34)
They are available to help with programming questions on the programming assignments and projects. However, they are not available to code for you (you wish!).

URL: https://engineering.purdue.edu/ee695b/

Textbook: No text book.
Reference Books:  (No need to buy since only parts of each will be used and I will provide photocopies of relevant portions.)

Apart from these, the course will use technical conference and journal papers. You are expected to get the papers from IEEExplore or ACM Digital Library.

Course Structure:
Class project: There will be separate research projects that each team of 2 or 3 students will work on. Each project will focus on one aspect of fault-tolerant system design and will test the ability to design, model or implement, execute experiments and perform evaluation. The target will be to produce work that can be sent for a conference publication, which has happened with many projects in the past. Being graduate students, it is in the best interest of your career to build your publication record.
There will be the following phases in the project, each with their tentative timeline.


List of suggested projects made available

September 10

Project teams formed, discussion of project ideas with instructor

September 10-21

Project proposals submitted

September 24

Preliminary project report

October 23

Interim project presentations (15 minutes each group)

October 23, 25

Final project presentations

Last two days of class

Final project report

Last day of exam week (Dec 14)

Exams: There will be a mid-term and a final exam. Each exam will be in-class and open book. The in-class mid-term exam will be a 1 hour exam. The final exam will be comprehensive.
Homeworks: There will be three homeworks – two written and one programming-based. The programming-based homework will introduce a widely-used system modeling tool called UltraSAN. You will use it to model a system and solve the model to determine the dependability characteristics of the system. This will give you valuable exposure to how you can evaluate a system by modeling its relevant parts.
Active Learning Activity: We will have activities in-class where you solve problems based on material covered in the previous week’s lectures. You would do this in a group with the class being divided into two groups. We will see which group comes out on top after each activity!
Submissions: All homework submissions will be done electronically through Blackboard.
Presentations: There will be two presentations given by industry leaders who build fault-tolerant computer systems.  

Grade Allocation:
            Course project: 50%
            Mid-term: 15%
            Final: 20%
            Homeworks: 15%

For reference, the class performances in the last two offerings of this course were: Fall 2012 – 2 A+, 7 A, 1 A-; Spring 2011 –  3 A+, 11 A, 1 A-; Spring 2009 – 2 A+, 8 A, 1 A-; Spring 2007 – 11 A, 1 B.

Lecture Outline
This is the tentative outline of coverage of topics in the class.


Introduction: Motivation, System view of high availability design, Two commercial examples (Stratus and Chameleon)

2

Probability review, distributions

4

Hardware redundancy: Basic approaches, Static & Dynamic, Voting, Fault tolerant interconnection networks
·        Application: FTMP

3

Error detection techniques: Watchdog processors, Heartbeats, Consistency and capability checking, Data audits, Assertions, Control-flow checking
·        Application: DHCP

4

Software fault tolerance: Process pairs, Robust data structures, N version programming, Recovery blocks, Replica consistency & reintegration, Multithreaded programs
·        Application: VAX

5

Network fault tolerance: Reliable communication protocols, Agreement protocols, Database commit protocols
·        Application: Distributed SQL server

5

Practical steps in design of high availability networked systems
·        Application: Web services, Highly available clusters

3

Experimental Evaluation: Modeling
·        Application: UltraSAN, Sharpe

4

Checkpointing & Recovery
·        Application: Microcheckpointing

4

Experimental Evaluation: Simulation and Fault-injection based

3

Practical Systems for Fault Tolerance: Putting it all together
·        Application: Google high availability file system
·        Application: NASA Remote Exploration & Experimentation System

2

Industry presentations

2

Discussion of projects

1

Presentation of projects

2

Tests

1

Course Policies
1. Academic honesty. The ECE faculty expect every member of the Purdue community to practice honorable and ethical behavior both inside and outside the classroom. Any actions that might unfairly improve a student’s score on homework, quizzes, or examinations will be considered cheating and will not be tolerated.
Examples of cheating include (but are not limited to):

Cheating on an assignment or examination will result in a failing grade for the course. All occurrences of academic dishonesty will be reported to the Assistant Dean of Students and copied to the ECE Associate Head for Education. If there is any question as to whether a given action might be construed as cheating, please see the instructor before you engage in any such action.

2. Homework/Projects. Please submit your homeworks and projects by the due date and time. Failure to do so will result in a penalty of 10% of the grade on the assignment for each hour it is late. Beyond 5 hours, you will not get any credit for the assignment.

The assignments will be returned electronically within a week with instructor comments.

3. Regrade Requests. Exams and homeworks may be submitted for regrading up to one week after they are returned to the class. To request a regrade, write an explanation of your request on a separate sheet of paper and attach it to the homework or the exam, then give it to the professor. A regrade request may increase or decrease your grade.

4. Feedback. I actively solicit positive and negative feedback throughout the course. If you have a complaint about how the course is taught or organized, constructive feedback on what would work better for you, or topics that you would want to see covered in the course, please send e-mail feedback to sbagchi@purdue.edu any time during the semester or afterwards. Feedback will in no way negatively influence your grade—thoughtful feedback both positive and negative is much appreciated. Anonymous feedback can also be given through a form that will be set up and made accessible through the course web page. This does not track any personally identifiable information. In addition, an interim student evaluation will be collected to make any mid-stream adjustments in the class.