Dependable Computing Systems Laboratory


Home	Projects	Publications	Presentations	People	News	Activities	About DCSL	Internal


<< All Projects	Debugging Scalability Problems for Large Scale Systems
Summary	Developing correct and efficient software for large scale system is a challenging task. Developers may oversee corner cases in large scale runs, employ inefficient algorithms that do not scale, conduct premature performance optimization, or take the opposite approach by not optimizing their code at all. Such program errors and inefficiencies can result in an especially subtle class of bugs that are scale-dependent: while small-scale test cases may not exhibit the bug, the bug arises in large-scale production runs, and can change the result or performance of an application. The problem roots in the fact that most program testing is not done on production scale system, which leaves the developers searching in darkness for performance and correctness of their programs on large scale production system. Classical program analysis based debugging techniques would not help here because they rely on programmer-written rules to define expected behavior. To remedy this, researchers have leveraged statistical modeling techniques to compare program behavior between normal and buggy runs and find bugs without input from programmers. However, existing statistical debugging techniques do not take scale of execution into consideration when building their models and would require a large number of both correct and buggy runs in production system to build effective models for debugging. This requirement severely restricts their usage in debugging real-world production systems because end users are typically reluctant to share their data for the concerns of leaking privacy and business secret while developers do not have x the luxury to test their programs repeatedly on production systems, generating no revenue but wasted circles and electricity bills. In this work, we develop a series of statistical debugging techniques to detect and localize bugs without the burdensome data collection requirement for large scale production system based on a key observation that most programs developed for large scale system exhibit behavior features predictable from the scale of run. Our first technique, Vrisha, is capable of detecting bugs in large-scale programs by building models of behavior based on a series of small scale bug-free runs. These models are constructed using kernel canonical correlation analysis (KCCA) and exploit scale-determined program features, whose values are predictably dependent on the scale of execution. Second, we extend Vrisha with two bug localization techniques, Abhranta and WuKong, based on feature reconstruction, that can pinpoint bugs to individual program features. Abhranta’model is an adaptation of Vrisha’s to strengthen its power to bug localization by sacrificing the accuracy of bug detection. WuKong is a clean slate approach that applies logarithmic transformed linear regression to extrapolate the values of program features in large scale runs based on training data collected in small scale runs. We have applied these techniques to detect and localize a variety of real-world scale-dependent bugs found in a popular MPI library, a DHT-based file sharing application, and synthetic faults injected into benchmarks for performance evaluation of supercomputers and showed that our techniques can be implemented with reasonably low overhead and high prediction accuracy.
Achieved Technical Goals
Publications
Future Work
Students
Code & Data
Funding Source

465 Northwestern Avenue, West Lafayette, IN 47907 | dcsl@ecn.purdue.edu | +1 765 494 3510