Announcement of New Graduate
Research Assistantship Position in the Dependable Computing Systems Lab (DCSL)
Topic: Why do large-scale
computing clusters fail? How can we reduce these failure incidences?
Saurabh
Bagchi
School
of Electrical and Computer Engineering
Department
of Computer Science
Contact:
765-494-3362; sbagchi@purdue.edu
Posted: October 22, 2015
DCSL is looking for a new graduate research assistant to work on the following sponsored project, starting in Spring 2016. The project is for 3 years, and is open to a PhD student, in the 1st or 2nd year of his/her study. The interview process will start now and will conclude by end November.
Application
procedure
Send an email note to Prof. Saurabh Bagchi with your CV (in pdf) and answers to the following specific questions in the body of the email. Qualified candidates will be invited for interviews.
1. When did you start your Masters/PhD?
2. What are your grades in courses at Purdue?
3. What is your undergrad school and what was your standing in your department (e.g., 3rd among 50 students in Computer Science, top 5% in 60 students in ECE)?
4. What are your grades in programming courses in your undergraduate?
5. Is there a Purdue person (professor, supervisor, etc.) who can speak about your qualifications?
Problem
Statement
While dependability has become a critical property of computing systems around us, there is a dearth of publicly available computer system usage and failure data. Today there does not exist any open data repository for a recent computing infrastructure that is large enough, diverse enough, and with enough information about the infrastructure and the applications that run on them. We have recently started on an NSF-supported project to address this long-felt need by building an open data repository containing system configuration, usage, and failure data from large computing clusters at two institutions, Purdue and UIUC.
Participants
Purdue, including ITaP; UIUC, including NCSA; Los Alamos National Lab (LANL); Lawrence Livermore National Lab (LLNL)
Project Tasks
Our project will create a usable data repository and tools to analyze why components in these clusters fail and what can be done to reduce these incidents.
1. Data analysis of system usage data. We are collecting a wealth of system usage data, such as, from syslog. We are also collecting data about the jobs that run on these computing clusters. We want to perform statistical data analysis on the system usage and the job resource utilization data to identify usage patterns and sources of bottlenecks.
2. Data analysis of failure data. We want to perform statistical analysis to uncover the root causes of failures, of machines, disks, network elements, etc. We want to correlate such failure events with system and job resource usage.
3. Mitigation action. We want to suggest possible mitigation actions to reduce the incidence of resource congestion or failures. For this project, we have the ability to deploy some of these mitigation actions in our production systems at both campuses and observe the effects.
Qualifications
Required: Ability to process and analyze large volumes of data; Knowledge of C and a scripting language; basic ML tools.
Desirable: System administration.