Debugging applications on today's largest supercomputers
Debugging large-scale programs that run on the largest supercomputers is significantly more difficult than debugging serial programs. Today's top supercomputers have more than a million cores, while human cognitive abilities are overwhelmed when dealing with more than a few concurrent events. When debugging a program running on the supercomputers, programmers must check the state of multiple parallel processes and reason about many different execution paths. Traditional debugging tools scale poorly with massive parallelism, as they must orchestrate execution of a large number of processes and collect data from them efficiently. The push toward exascale computing has only increased the need for scalable debugging techniques.
Profs. Saurabh Bagchi and Milind Kulkarni and their team at Purdue have worked with Department of Energy's Lawrence Livermore National Laboratory (LLNL) for several years in developing software methodologies and usable tools that can be used for running programs reliably on the largest supercomputers. An article summarizing the different developments on the project has just appeared in the Communications of the ACM Magazine in the September 2015 issue. The project is ongoing and the team expects to now take their methodologies to heterogeneous computing systems, including to a GPU-enabled cluster at LLNL.
Link to Purdue ITaP article
Link to the ACM article
Link to a video explaining the project: