Announcement of New Graduate Research Assistantship Position in the Dependable Computing Systems Lab (DCSL)

Topic: Why do large-scale computing clusters fail? How can we reduce these failure incidences?


 

Saurabh Bagchi

School of Electrical and Computer Engineering

Department of Computer Science (Courtesy Appointment)

Purdue University

Contact: 765-494-3362; sbagchi@purdue.edu

 

 

Posted: December 27, 2016

DCSL is looking for a new graduate research assistant to work on an NSF-sponsored project on data analysis for reliability of large-scale computing clusters, to start from Spring 2017. The project is for 3 years, and is open to a PhD student in ECE or CS, in the 1st or 2nd year of his/her study. The interview process will start now and will conclude once a qualified applicant is identified.

Problem Statement

While dependability has become a critical property of computing systems around us, there is a dearth of publicly available computer system usage and failure data. Today there does not exist any open data repository for a recent computing infrastructure that is large enough, diverse enough, and with enough information about the infrastructure and the applications that run on them. We have an ongoing NSF-supported project to address this long-felt need by building an open data repository containing system configuration, usage, and failure data from large computing clusters at two institutions, Purdue and UIUC. Our first results on this project can be found in the following paper and the related web resource.

Subrata Mitra, Suhas Raveesh Javagal, Amiya K. Maji (ITaP), Todd Gamblin (LLNL), Adam Moody (LLNL), Stephen Harrell (ITaP), and Saurabh Bagchi, “A Study of Failures in Community Clusters: The Case of Conte,” At the 7th IEEE International Workshop on Program Debugging, co-located with ISSRE, pp. 1-8, Oct 23-27, 2016, Ottawa, Canada.

“FRESCO - The Open Data Repository for Workloads and Failures in Large-scale Computing Clusters,” By Saurabh Bagchi, Suhas Raveesh Javagal, Subrata Mitra, Stephen Harrell, and Charles Schwarz. URL: http://www.purdue.edu/fresco

Participants

Purdue, including ITaP; UIUC, including NCSA; Los Alamos National Lab (LANL); Lawrence Livermore National Lab (LLNL)

Project Tasks

Our project will create a usable data repository and tools to analyze why components in these clusters fail and what can be done to reduce these incidents. This will involve data analysis of usage and failure data and using machine learning and computer systems techniques to perform root cause analysis, which can then be verified in collaboration with ITaP and NCSA personnel.

 Qualifications

Required: Ability to process and analyze large volumes of data; Knowledge of C and a scripting language; basic ML tools.

Desirable: System administration.

Application procedure

Send an email note to Prof. Saurabh Bagchi with your CV (in pdf) and answers to the following specific questions in the body of the email. Qualified candidates will be invited for interviews.

1.     When did you start your Masters/PhD?

2.     What are your grades in courses at Purdue?

3.     What is your undergrad school and what was your standing in your department (e.g., 3rd among 50 students in Computer Science, top 5% in 60 students in ECE)?

4.     What are your grades in programming courses in your undergraduate?

5.     Is there a Purdue person (professor, supervisor, etc.) who can speak about your qualifications?