August 31, 2015

Debugging applications on today's largest supercomputers

Prof. Saurabh Bagchi (right) with two graduate researchers in his group, Subrata Mitra (middle) and Suhas Javagal, deep in a data center of Purdue ITaP, surrounded by a supercomputer for which they have been doing some of their tool development and debugging.
Profs. Saurabh Bagchi and Milind Kulkarni and their team at Purdue have worked with Department of Energy's Lawrence Livermore National Laboratory (LLNL) for several years in developing software methodologies and usable tools that can be used for running programs reliably on the largest supercomputers.

Debugging large-scale programs that run on the largest supercomputers is significantly more difficult than debugging serial programs. Today's top supercomputers have more than a million cores, while human cognitive abilities are overwhelmed when dealing with more than a few concurrent events. When debugging a program running on the supercomputers, programmers must check the state of multiple parallel processes and reason about many different execution paths. Traditional debugging tools scale poorly with massive parallelism, as they must orchestrate execution of a large number of processes and collect data from them efficiently. The push toward exascale computing has only increased the need for scalable debugging techniques.

Profs. Saurabh Bagchi and Milind Kulkarni and their team at Purdue have worked with Department of Energy's Lawrence Livermore National Laboratory (LLNL) for several years in developing software methodologies and usable tools that can be used for running programs reliably on the largest supercomputers. An article summarizing the different developments on the project has just appeared in the Communications of the ACM Magazine in the September 2015 issue. The project is ongoing and the team expects to now take their methodologies to heterogeneous computing systems, including to a GPU-enabled cluster at LLNL.

Link to Purdue ITaP article

Link to the ACM article

Link to a video explaining the project: