Announcement of New Graduate Research Assistantship Position in the Dependable Computing Systems Lab (DCSL)

Topic: Why do large-scale computing clusters fail? How can we reduce these failure incidences?


 

Saurabh Bagchi

School of Electrical and Computer Engineering

Department of Computer Science

Purdue University

Contact: 765-494-3362; sbagchi@purdue.edu

 

 

Posted: October 22, 2015

DCSL is looking for a new graduate research assistant to work on the following sponsored project, starting in Spring 2016. The project is for 3 years, and is open to a PhD student, in the 1st or 2nd year of his/her study. The interview process will start now and will conclude by end November.

Application procedure

Send an email note to Prof. Saurabh Bagchi with your CV (in pdf) and answers to the following specific questions in the body of the email. Qualified candidates will be invited for interviews.

1.     When did you start your Masters/PhD?

2.     What are your grades in courses at Purdue?

3.     What is your undergrad school and what was your standing in your department (e.g., 3rd among 50 students in Computer Science, top 5% in 60 students in ECE)?

4.     What are your grades in programming courses in your undergraduate?

5.     Is there a Purdue person (professor, supervisor, etc.) who can speak about your qualifications?

Problem Statement

While dependability has become a critical property of computing systems around us, there is a dearth of publicly available computer system usage and failure data. Today there does not exist any open data repository for a recent computing infrastructure that is large enough, diverse enough, and with enough information about the infrastructure and the applications that run on them. We have recently started on an NSF-supported project to address this long-felt need by building an open data repository containing system configuration, usage, and failure data from large computing clusters at two institutions, Purdue and UIUC.

Participants

Purdue, including ITaP; UIUC, including NCSA; Los Alamos National Lab (LANL); Lawrence Livermore National Lab (LLNL)

Project Tasks

Our project will create a usable data repository and tools to analyze why components in these clusters fail and what can be done to reduce these incidents.

1. Data analysis of system usage data. We are collecting a wealth of system usage data, such as, from syslog. We are also collecting data about the jobs that run on these computing clusters. We want to perform statistical data analysis on the system usage and the job resource utilization data to identify usage patterns and sources of bottlenecks.  

2. Data analysis of failure data. We want to perform statistical analysis to uncover the root causes of failures, of machines, disks, network elements, etc. We want to correlate such failure events with system and job resource usage.

3. Mitigation action. We want to suggest possible mitigation actions to reduce the incidence of resource congestion or failures. For this project, we have the ability to deploy some of these mitigation actions in our production systems at both campuses and observe the effects.

Qualifications

Required: Ability to process and analyze large volumes of data; Knowledge of C and a scripting language; basic ML tools.

Desirable: System administration.