Supercomputing Conference 2015, 2016

Bird of Feather Session on "Fresco: An Open Failure Data Repository for Dependability Research and Practice"

Organizers: Saurabh Bagchi, Carol Song (Purdue); Ravishankar Iyer, Zbigniew Kalbarczyk (U. of Illinois at Urbana-Champaign); Nathan Debardeleben (Los Alamos National Lab)

Fresco: An Open Failure and System Usage Data Repository for Dependability Research and Practice

URL: https://www.purdue.edu/fresco

We organized a Bird of Feather (BoF) session at the ACM/IEEE Supercomputing Conference on November 18, 2015 and proposed one to the Supercomputing Conference 2016. In the session held over 1 hour, we had a vigorous and constructive discussion on the need for open data repository of system usage and failure data. We presented our vision and our current status and solicited feedback about what it would take for this to be useful to consumers of data and what can be done to make it easy for a variety of organizations to contribute data to the repository. The feedback was collected orally through discussion at the BoF, plus formally through a questionnaire handed out at the session.

This BoF unveiled a recently awarded NSF-supported effort (CNS-1513051, CNS-1513197) for an open failure data repository meant to enable data-driven resiliency research for large-scale computing clusters from design, monitoring, and operational perspectives. To address the dearth of large publicly available datasets, we have started on this 3-year project to create a repository of system configuration, usage, and failure information from large computing systems. We have seeded the effort using a large Purdue computing cluster over a six-month period. As part of this BoF, we collected requirements for a larger, multi-institution repository and demonstrated the usage and data analytics tools for the current repository. We demonstrated how simple analytic tools can be run on the dataset without needing to download the data and simple visualization tools that can be fed by the results of the analytic tools.

Results: We collected the questionnaire responses, contacted those who had volunteered for further discussion (about 10 researchers and practitioners), and incorporated the requirements into our re-designed data repository. We also drove the development of some of our initial analytic tools from the requirements that were identified at the BoF.

Materials: Here are the primary materials related to this BoF. All are in PDF format.

  1. Presentation material from Saurabh
  2. Presentation material from Carol
  3. Presentation material from Nathan
  4. Questionnaire for collecting feedback from the attendees
  5. Responses to the questionnaire (the responders' names have been redacted): [ Batch 1 ] [ Batch 2 ]
  6. Flyer for announcing the BoF at the venue
  7. Proposal for Supercomputing 2016

Last updated: August 2016