Dependable Computing Systems Laboratory

By Year

By Project

Fault tolerance for Distributed Applications

"Reliable and Efficient Distributed Checkpointing System for Grid Environments," Tanzima Zerin Islam, Saurabh Bagchi, and Rudolf Eigenmann. Journal of Grid Computing, December 2014, Volume 12, Issue 4, pp 593-613 [ Abstract ]

In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion times. Checkpoint-recovery is an attractive mechanism for recovering from such “failures”. Today’s FGCS systems often use expensive, high-performance dedicated checkpoint servers. However, in geographically distributed clusters, this may incur high checkpoint transfer latencies. In this paper we present a distributed checkpointing system called FALCON that uses available disk resources of the FGCS machines as shared checkpoint repositories. However, an unavailable storage host may lead to loss of checkpoint data. Therefore, we model the failures of a storage host and develop a prediction algorithm for choosing reliable checkpoint repositories. We experiment with FALCON in the university-wide Condor testbed at Purdue and show improved and consistent performance for guest jobs in the presence of irregular resource availability.
"Addressing failures in exascale computing," Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak et al. International Journal of High Performance Computing Applications (2014): doi:10.1177/1094342014522573, pp. 1-45 [ Abstract ]

We present here a report produced by a workshop on ‘Addressing failures in exascale computing’ held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach. The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia, and their interests ranged from theory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions.
"Mitigating Interference in Cloud Services by Middleware Reconfiguration," Amiya Maji, Subrata Mitra, Bowen Zhou, Saurabh Bagchi and Akshat Verma (IBM Research). Accepted to appear at the 15th ACM/IFIP/USENIX Middleware conference, pp. 1-12, Nov 16-21, 2014. (Acceptance rate: 27/144 = 18.8%) [ Presentation ] [ Abstract ]

Application performance has been and remains one of top five concerns since the inception of cloud computing. A primary determinant of application performance is multi-tenancy or sharing of hardware resources in clouds. While some hardware resources can be partitioned well among VMs (such as CPUs), many others cannot (such as memory bandwidth). In this paper, we focus on understanding the variability in application performance on a cloud and explore ways for an end customer to deal with it. Based on rigorous experiments using CloudSuite, a popular Web2.0 benchmark, running on EC2, we found that interference-induced performance degradation is a reality. On a private cloud testbed, we also observed that interference impacts the choice of best configuration values for applications and middleware. We posit that intelligent reconfiguration of application parameters presents a way for an end customer to reduce the impact of interference. However, tuning the application to deal with interference is challenging because of two fundamental reasons --- the configuration depends on the nature and degree of interference and there are inter-parameter dependencies. We design and implement the IC2 system to address the challenges of detection and mitigation of performance interference in clouds. Compared to an interference-agnostic configuration, the proposed solution provides upto 29% and 40% improvement in average response time on EC2 and a private cloud testbed respectively.
"Orion: Scaling Genomic Sequence Matching with Fine-Grained Parallelization," Kanak Mahadik, Somali Chaterji, Bowen Zhou, Milind Kulkarni, and Saurabh Bagchi. Accepted to appear at the International Conference for High Performance Computing, Networking, Storage, and Analysis (Supercomputing), pp. 1-11, Nov 16-21, 2014. (Acceptance rate: 82/394 = 20.8%) [ Presentation ] [ Abstract ]

Gene sequencing instruments are producing huge volumes of data, straining the capabilities of current database searching algorithms and hindering efforts of researchers analyzing larger collections of data to obtain greater insights. In the space of parallel genomic sequence search, most of the popular softwares, like mpiBLAST, use the database segmentation approach, wherein the entire database is sharded and searched on different nodes. However this approach does not scale well with the increasing length of individual query sequences as well as the rapid growth in size of sequence databases. In this paper, we propose a fine-grained parallelism technique, called Orion, that divides the input query into an adaptive number of fragments and shards the database. Our technique achieves higher parallelism (and hence speedup) and load balancing, while maintaining 100% accuracy. We show that it is 12.3X faster than mpiBLAST for solving a relevant comparative genomics problem.
"Is Your Web Server Suffering from Undue Stress due to Duplicate Requests?," Fahad A. Arshad, Amiya K. Maji, Sidharth Mudgal, and Saurabh Bagchi. Accepted as a Short Paper, At the 11th International Conference on Autonomic Computing (ICAC), pp. 105-111, June 18-20, 2014, Philadelphia, PA. (Acceptance rate: 12 (full papers) + 10 (short papers)/53 = 41.5%) [ Presentation ] [ Abstract ]

An important, if not very well known, problem that afflicts many web servers is duplicate client browser requests due to server-side problems. A legitimate request is followed by a redundant request, thus increasing the load on the server and corrupting state at the server end (such as, the hit count for the page) and at the client end (such as, state maintained through a cookie). This problem has been reported in many developer blogs and has been found to afflict even popular web sites, such as CNN and YouTube. However, to date, there has not been a scientific, technical solution to this problem that is browser vendor neutral. In this paper, we provide such a solution which we call GRIFFIN. We identify that the two root causes of the problem are missing resource at the server end or duplicated Javascripts embedded in the page. We have the insight that dynamic tracing of the function call sequence creates a signature that can be used to differentiate between legitimate and duplicate requests. We apply our technique to find unreported problems in a large production scientific collaboration web service called HUBzero, which are fixed upon reporting the problems. Our experiments show an average overhead of 1.29X for tracing the PHP-runtime on HUBzero across 60 unique HTTP transactions. GRIFFIN has zero false-positives (when run across HTTP transaction of size one and two) and an average detection accuracy of 78% across 60 HTTP transactions.
"Diagnosis of Performance Faults in Large Scale MPI Applications via Probabilistic Progress-Dependence Inference," Ignacio Laguna (LLNL), Dong Ahn (LLNL), Bronis de Supinski (LLNL), Saurabh Bagchi, and Todd Gamblin (LLNL), Accepted to appear in IEEE Transactions on Parallel and Distributed Systems (TPDS), pp. 1-15, notification of acceptance: March 2014. [ Abstract ]

Debugging large-scale parallel applications is challenging. Most existing techniques provide little information about failure root causes. Further, most debuggers significantly slow down program execution, and run sluggishly with massively parallel applications. This paper presents a novel technique that scalably infers the tasks in a parallel program on which a failure occurred, as well as the code in which it originated. Our technique combines scalable runtime analysis with static analysis to determine the least-progressed task(s) and to identify the code lines at which the failure arose. We present a novel algorithm that infers probabilistically progress dependence among MPI tasks using a globally constructed Markov model that represents tasks' control-flow behavior. In comparison to previous work, our algorithm infers more precisely the least-progressed task. We combine this technique with static backward slicing analysis, further isolating the code responsible for the current state. A blind study demonstrates that our technique isolates the root cause of a concurrency bug in a molecular dynamics simulation, which only manifests itself at 7,996 tasks or more. We extensively evaluate fault coverage of our technique via fault injections in ten HPC benchmarks and show that our analysis takes less than a few seconds on thousands of parallel tasks.
"Accurate Application Progress Analysis for Large-Scale Parallel Debugging," Subrata Mitra, Ignacio Laguna, Dong H. Ahn, Saurabh Bagchi, Martin Schulz, and Todd Gamblin. At the ACM International Symposium on Programming Language Design and Implementation (PLDI), pp. 193-203, Edinburgh, UK, June 9-11, 2014. (Acceptance rate: 52/287 = 18.1%) [ Presentation ] [ Abstract ]

Debugging large-scale parallel applications is challenging. In most HPC applications, parallel tasks progress in a coordinated fashion, and thus a fault in one task can propagate quickly to other tasks, making it difficult to debug. Fortunately, finding the leastprogressed tasks can significantly reduce effort to identify the task where the fault originated. However, existing approaches to detecting them suffer low accuracy and large overheads; either they use imprecise static analysis or are unable to infer progress dependence inside loops. We present a loop-aware progress-dependence analysis tool, called PRODOMETER, which determines relative progress among parallel tasks using only dynamic analysis. Our fault-injection experiments suggest that its accuracy and precision are over 90% for most cases and that it scales well up to 16,384 MPI tasks. Further, our case study shows that it significantly helped diagnosing a perplexing error in MPI, which only manifested at large scale.
“Characterizing Configuration Problems in Java EE Application Servers: An Empirical Study with GlassFish and JBoss,” Fahad A. Arshad, Rebecca J. Krause, and Saurabh Bagchi, At the 24th IEEE International Symposium on Software Reliability Engineering (ISSRE), pp. 198-207, Pasadena, CA, November 4-7, 2013. (Acceptance rate: 46/131 = 35.1%) [ Presentation ]

We present a characterization study on configuration problems for Java EE application servers. Our study analyzes a total of 281 bug-reports in two phases: a longer (phase- 1) and a shorter (phase-2) phase, from bug tracking systems of two popular open source servers, GlassFish and JBoss. We study configuration problems in four orthogonal dimensions: problemtype, problem-time, problem-manifestation and problem-culprit. A configuration problem, by type, is classified as a paramater, compatibility or a missing-component problem. Problem-time is classified as pre-boot-time, boot-time or run-time. A configuration problem manifestation is either silent or non-silent. Problemculprit is either the user or the developer of the application server. Our analysis shows that more than one-third of all problems in each server are configuration problems. Among all configuration problems for each server in phase-1, at-least 50% of problems are paramater-based and occur at run-time. In phase- 2, which focuses on specific versions over a shorter time-period, all three problem types parameter, compatibility and missingcomponent have an almost equal share. Further, on average 89% of configuration problems result in a non-silent manifestation, while 91% of them are due to mistakes by the developer and require code-modification to fix the problem. Finally, we test the robustness to configuration by injecting configuration-bugs at boot-time with SPECjEnterprise2010 application deployed in each server. JBoss performs better than GlassFish with all of the injections giving a non-silent manifestation as opposed to only 65% non-silent manifestations in GlassFish.

[ Abstract ]
“Automatic Problem Localization in Distributed Applications via Multi-dimensional Metric Profiling,” Ignacio Laguna, Subrata Mitra, Fahad A. Arshad, Nawanol Theera-Ampornpunt, Zongyang Zhu, Saurabh Bagchi, Samuel P. Midkiff, Mike Kistler (IBM Research), and Ahmed Gheith (IBM Research), At the 32nd International Symposium on Reliable Distributed Systems (SRDS), pp. 121-132, Braga, Portugal, September 30-October 3, 2013. (Acceptance rate: 22/67 = 32.8%) [ Presentation ] [ Abstract ]

Debugging today?s large-scale distributed applications is complex. Traditional debugging techniques such as breakpoint-based debugging and performance profiling require a substantial amount of domain knowledge and do not automate the process of locating bugs and performance anomalies. We present ORION, a framework to automate the problemlocalization process in distributed applications. From large set of metrics, ORION intelligently chooses important metrics and models the application?s runtime behavior through pairwise correlations of those metrics in the system, within multiple nonoverlapping time windows. When correlations deviate from those of a learned correct model due to a bug, our analysis pinpoints the metrics and code regions (class and method within it) that are most likely associated with the failure. We demonstrate our framework with several real-world failure cases in distributed applications such as: HBase, Hadoop DFS, a Purdue campuswide Java application for checking availability of lab machines, and a regression testing framework from IBM. Our results show that ORION is able to pinpoint the metrics and code regions that developers need to concentrate on to fix the failures.
“WuKong: Automatically Detecting and Localizing Bugs that Manifest at Large System Scales,” Bowen Zhou, Jonathan Too, Milind Kulkarni, and Saurabh Bagchi. At the 22nd International ACM Symposium on High Performance Parallel and Distributed Computing (HPDC), pp. 1-12, New York City, New York, June 17-21, 2013. (Acceptance rate: 20/131 = 15.3%) [ Presentation ] [ Abstract ]

A key challenge in developing large scale applications is finding bugs that are latent at the small scales of testing, but manifest themselves when the application is deployed at a large scale. Here, we ascribe a dual meaning to \large scale" - it could mean a large number of executing processes or applications ingesting large amounts of input data (or both). Traditional statistical techniques fail to detect or di- agnose such kinds of bugs because no error-free run is available at the large deployment scales for training purposes. Prior work used scaling models to detect anomalous behavior at large scales without training on correct behavior at that scale. However, that work cannot localize bugs automatically, i.e., cannot pinpoint the region of code responsible for the error. In this paper, we resolve that shortcoming by making the following three contributions: (i) we develop an automatic diagnosis technique, based on feature reconstruction; (ii) we design a heuristic to eectively prune the large feature space; and (iii) we demonstrate that our system scales well, in terms of both accuracy and overhead. We validate our design through a large-scale fault-injection study and two case-studies of real-world bugs, finding that our system can eectively localize bugs in 92.5% of the cases, dramatically easing the challenge of finding bugs in large- scale programs.
“mcrEngine: A Scalable Checkpointing System using Data-Aware Aggregation and Compression,” Tanzima Zerin Islam, Kathryn Mohror, Saurabh Bagchi, Adam Moody, Bronis R. de Supinski, and Rudolf Eigenmann. At the IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (Supercomputing), pp. 1-10, Salt Lake City, Utah, November 10-16, 2012. (Acceptance rate: 100/472 = 21.2%) (One of 8 papers that was a finalist for the best student paper) [ Presentation ] [ Abstract ]

High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint- restart system, MCRENGINE. MCRENGINE aggregates check- points from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that MCRENGINE reduces check- pointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.
"ABHRANTA: Locating Bugs that Manifest at Large System Scales," Bowen Zhou, Milind Kukarni, and Saurabh Bagchi. At the 8th Workshop on Hot Topics in System Dependability (HotDep) (co-located with OSDI '12), pp. 1-6, Hollywood, CA, October 7, 2012. (Acceptance rate: 10/24 = 41.7%) [ Presentation ] [ Abstract ]

A key challenge in developing large scale applications (both in system size and in input size) is finding bugs that are latent at the small scales of testing, only manifesting when a program is deployed at large scales. Traditional statistical techniques fail because no error-free run is available at deployment scales for training purposes. Prior work used scaling models to detect anomalous behavior at large scales without being trained on correct behavior at that scale. However, that work cannot localize bugs automatically. In this paper, we extend that work with automatic diagnosis technique, based on feature reconstruction, and validate our design through case studies with two real bugs from an MPI library and a DHT-based file sharing application.
“Probabilistic Diagnosis of Performance Faults in Large Scale Parallel Applications,” Ignacio Laguna, Dong H. Anh, Bronis R. de Supinski, Saurabh Bagchi, and Todd Gamblin. At the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 1-10, September 19-23, 2012, Minneapolis, MN. (Acceptance rate: 39/207 = 18.8%) [ Presentation ] [ Abstract ]

Debugging large-scale parallel applications is challenging. Most existing techniques provide mechanisms for process control, but they provide little information about the causes of failures. Most debuggers also scale poorly despite contin- ued growth in supercomputer core counts. We have devel- oped a novel, highly scalable tool to help developers under- stand and fix correctness problems at scale. Our tool proba- bilistically infers the least progressed task in MPI programs using Markov models of execution history and dependence analysis. This analysis guides program slicing to find code that may have caused a failure. In a blind study, we demon- strate that our tool can isolate the root cause of a partic- ularly perplexing bug encountered at scale in a molecular dynamics simulation. Further, we perform fault injections into two benchmark codes and measure the scalability of the tool. Our results show that it accurately detects the least pro- gressed task in most cases and can perform the diagnosis in a fraction of a second with thousands of tasks.
"Automatic Fault Characterization via Abnormality-Enhanced Classification," Greg Bronevetsky (LLNL), Ignacio Laguna, Saurabh Bagchi and Bronis R. de Supinski (LLNL). In the 42th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 8 pages, Boston, MA, June 25-28, 2012 [ Presentation ] [ Abstract ]

Enterprise and high-performance computing systems are growing extremely large and complex, employing many processors and diverse software/hardware stacks [1]. As these machines grow in scale, faults become more frequent and system complexity makes it difficult to detect and diagnose them. The difficulty is particularly large for faults that degrade system performance or cause erratic behavior but do not cause outright crashes. The cost of these errors is high since they significantly reduce system productivity, both initially and by time required to resolve them. Current system management techniques [2], [3] do not work well since they require manual examination of system behavior and do not identify root causes. When a fault is manifested, system administrators need timely notification about the type of fault, the time period in which it occurred and the processor on which it originated. Statistical modeling approaches can accurately character- ize normal and abnormal system behavior [4]. However, the complex effects of system faults are less amenable to these techniques. This paper demonstrates that the com- plexity of system faults makes traditional classification and clustering algorithms inadequate for characterizing them. We design novel techniques that combine classification algorithms with information on the abnormality of appli- cation behavior to improve detection and characterization accuracy significantly. Our experiments demonstrate that our techniques can detect and characterize faults with 85% accuracy, compared to just 12% accuracy for direct applications of traditional techniques.
"A Study of Soft Error Consequences in Hard Disk Drives," Timothy Tsai (Hitachi GST), Nawanol Theera-Ampornpunt and Saurabh Bagchi. In the 42th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 8 pages, Boston, MA, June 25-28, 2012 [ Presentation ] [ Abstract ]

Hard disk drives have multiple layers of fault tolerance mechanisms that protect against data loss. However, a few failures occasionally breach the entire set of mechanisms. To prevent such scenarios, we rely on failure prediction mechanisms to raise alarms with sufficient warning to allow the at-risk data to be copied to a safe location. A common failure prediction technique monitors the occurrence of soft errors and triggers an alarm when the soft error rate exceeds a specified threshold. This study uses data collected from over 50,000 customer deployed disk drives to evaluate the performance of a failure prediction algorithm that relies solely on soft errors to predict failures manifested as hard errors. The data analysis shows that soft errors alone cannot be used as a reliable predictor of hard errors. However, in those cases where soft errors do accurately predict hard errors, sufficient warning time exists for preventive actions.
"Large Scale Debugging of Parallel Tasks with AutomaDeD," Ignacio Laguna, Todd Gamblin, Bronis R. de Supinski, Saurabh Bagchi, Greg Bronevetsky, Dong H. Ahn, Martin Schulz, and Barry Rountree, At the Supercomputing Conference, 12 pages, Seattle, WA, Nov 12-18, 2011. (Acceptance rate: 74/352 = 21.0%) [ Presentation ] [ Abstract ]

Developing correct HPC applications continues to be a challenge as the number of cores increases in today?s largest systems. Most existing debugging techniques perform poorly at large scales and do not automatically locate the parts of the parallel application in which the error occurs. The overhead of collecting large amounts of runtime information and an absence of scalable error detection algorithms generally cause poor scalability. In this work, we present novel, highly efficient techniques that facilitate the process of debugging large scale parallel applications. Our approach extends our previous work, AutomaDeD, in three major areas to isolate anomalous tasks in a scalable manner: (i) we efficiently compare elements of graph models (used in AutomaDeD to model parallel tasks) using pre-computed lookup-tables and by pointer comparison; (ii) we compress per-task graph models before the error detection analysis so that comparison between models involves many fewer elements; (iii) we use scalable sampling-based clustering and nearest-neighbor techniques to isolate abnormal tasks when bugs and performance anomalies are manifested. Our evaluation with fault injections shows that AutomaDeD scales well to thousands of tasks and that it can find anomalous tasks in under 5 seconds in an online manner.
"Dangers and Joys of Stock Trading on the Web: Failure Characterization of a Three-Tier Web Service," Fahad Arshad and Saurabh Bagchi, At the 30th IEEE Symposium on Reliable Distributed Systems (SRDS), 10 pages, Madrid, Spain, Oct 5-7, 2011. (Acceptance rate: 30/88 = 34.1%)

Detecting and isolating bugs that arise in parallel programs is a tedious and a challenging task. An especially subtle class of bugs are those that are scale-dependent: while small-scale test cases may not exhibit the bug, the bug arises in large-scale production runs, and can change the result or performance of an application. A popular approach to finding bugs is statistical bug detection, where abnormal behavior is detected through comparison with bug-free behavior. Unfortunately, for scale-dependent bugs, there may not be bug-free runs at large scales and therefore traditional sta- tistical techniques are not viable. In this paper, we propose Vrisha, a statistical approach to detecting and localizing scale-dependent bugs. Vrisha detects bugs in large-scale programs by building models of behavior based on bug-free behavior at small scales. These models are constructed using kernel canonical correlation analysis (KCCA) and exploit scale-determined properties, whose values are predictably dependent on application scale. We use Vrisha to detect and diagnose two bugs that appear in popular MPI libraries, and show that our techniques can be implemented with low overhead and low false-positive rates.[

[ Presentation ] [ Abstract ]

Characterizing latent software faults is crucial to address dependability issues of current three-tier systems. A client should not have a misconception that a transaction succeeded, when in reality, it failed due to a silent error. We present a fault injection-based evaluation to characterize silent and non-silent software failures in a representative threetier web service, one that mimics a day trading application widely used for benchmarking application servers. For failure characterization, we quantify distribution of silent and nonsilent failures, and recommend low cost application-generic and application-specific consistency checks, which improve the reliability of the application. We inject three variants of nullcall, where a callee returns null to the caller without executing business logic. Additionally, we inject three types of unchecked exceptions and analyze the reaction of our application. Our results show that 49% of error injections from null-calls result in silent failures, while 34% of unchecked exceptions result in silent failures. Our generic-consistency check can detect silent failures in null-calls with an accuracy as high as 100%. Nonsilent failures with unchecked exceptions can be detected with an accuracy of 42% with our application-specific checks.
Vrisha: Using Scaling Properties of Parallel Programs for Bug Detection and Localization,” Bowen Zhou, Milind Kulkarni, and Saurabh Bagchi, At the 20th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 12 pages, San Jose, California, June 8-11, 2011. (Acceptance rate: 22/170 = 12.9%) [ Presentation ] [ Abstract ]

Detecting and isolating bugs that arise in parallel programs is a tedious and a challenging task. An especially subtle class of bugs are those that are scale-dependent: while small-scale test cases may not exhibit the bug, the bug arises in large-scale production runs, and can change the result or performance of an application. A popular approach to finding bugs is statistical bug detection, where abnormal behavior is detected through comparison with bug-free behavior. Unfortunately, for scale-dependent bugs, there may not be bug-free runs at large scales and therefore traditional sta- tistical techniques are not viable. In this paper, we propose Vrisha, a statistical approach to detecting and localizing scale-dependent bugs. Vrisha detects bugs in large-scale programs by building models of behavior based on bug-free behavior at small scales. These models are constructed using kernel canonical correlation analysis (KCCA) and exploit scale-determined properties, whose values are predictably dependent on application scale. We use Vrisha to detect and diagnose two bugs that appear in popular MPI libraries, and show that our techniques can be implemented with low overhead and low false-positive rates.
"Characterizing Failures in Mobile OSes: A Case Study with Android and Symbian": Amiya Kumar Maji, Kangli Hao, Salmin Sultana, and Saurabh Bagchi. At the 21st annual International Symposium on Software Reliability Engineering (ISSRE 2010), 10 pages, Nov 1-4, 2010, San Jose, California. (Acceptance rate: 40/130 = 30.8%) [ Abstract ]

As smart phones grow in popularity, manufacturers are in a race to pack an increasingly rich set of features into these tiny devices. This brings additional complexity in the system software that has to fit within the constraints of the devices (chiefly memory, stable storage, and power consumption) and hence, new bugs are revealed. How this evolution of smartphones impacts their reliability is a question that has been largely unexplored till now. With the release of open source OSes for hand-held devices, such as, Android (open sourced in October 2008) and Symbian (open sourced in February 2010), we are now in a position to explore the above question. In this paper, we analyze the reported cases of failures of Android and Symbian based on bug reports posted by thirdparty developers and end users and documentation of bug fixes from Android developers. First, based on 628 developer reports, our study looks into the manifestation of failures in different modules of Android and their characteristics, such as, their transience in time. Next, we analyze similar properties of Symbian bugs based on 153 failure reports. Our study indicates that Development tools, Web browsers, and Multimedia applications are most error-prone in both these systems. We further analyze 233 bug fixes for Android and categorized the different types of code modifications required for the fixes. The analysis shows that 78% of errors required minor code changes, with the largest share of these coming from modifications to attribute values and conditions. Our final analysis focuses on the relation between customizability, code complexity, and reliability in Android and Symbian. We find that despite high cyclomatic complexity, the bug densities in Android and Symbian are surprisingly low. However, the support for customizability does impact the reliability of mobile OSes and there are cautionary tales for their further development.
"AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks": Greg Bronevetsky, Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin Schulz. In the 40th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 10 pages, June 28-July 1, 2010, Chicago, IL. (Acceptance rate (DCCS track): 40/174 = 23%) [ Presentation ] [ Abstract ]

Today’s largest systems have over 100,000 cores, with million-core systems expected over the next few years. This growing scale makes debugging the applications that run on them a daunting challenge. Few debugging tools perform well at this scale and most provide an overload of information about the entire job. Developers need tools that quickly direct them to the root cause of the problem. This paper presents AutomaDeD, a tool that identifieswhich tasks of a large-scale application first manifest a bug at a specific code region and specific program execution point. AutomaDeD statistically models the application’s controlflow and timing behavior, grouping tasks and identifying deviations from normal execution, which significantly reduces debugging effort. In addition to a case study in which AutomaDeD locates a bug that occurred during development of MVAPICH, we evaluate AutomaDeD on a range of bugs injected into the NAS parallel benchmarks. Our results demonstrate that AutomaDeD detects the time period when a bug first manifested with 90% accuracy for stalls and hangs and 70% accuracy for interference faults. It identifies the subset of processes first affected by the fault with 80% accuracy and 70% accuracy, respectively and the code region where the fault first manifested with 90% and 50% accuracy, respectively.
"How To Keep Your Head Above Water While Detecting Errors": Ignacio Laguna, Fahad A. Arshad, David M. Grothe, and Saurabh Bagchi. In: ACM/IFIP/USENIX 10th International Middleware Conference, November 30-December 4, 2009, Urbana-Champaign, Illinois. (Acceptance rate: 21/110 = 19.1%) [ Presentation ] [ abstract ]

Today’s distributed systems need runtime error detection to catch errors arising from software bugs, hardware errors, or unexpected operating conditions. A prominent class of error detection techniques operates in a stateful manner, i.e., it keeps track of the state of the application being monitored and then matches state-based rules. Large-scale distributed applications generate a high volume of messages that can overwhelm the capacity of a stateful detection system. An existing approach to handle this is to randomly sample the messages and process a subset. However, this approach, leads to non-determinism with respect to the detection system’s view of what state the application is in. This in turn leads to degradation in the quality of detection. We present an intelligent sampling algorithm and a Hidden Markov Model (HMM)-based algorithm to select the messages that the detection system processes and determine the application states such that the non-determinism is minimized. We also present a mechanism for selectively triggering computationally intensive rules based on a light-weight mechanism to determine if the rule is likely to be flagged. We demonstrate the techniques in a detection system called Monitor applied to a J2EE multi-tier application. We empirically evaluate the performance of Monitor under different load conditions and error scenarios and compare it to a previous system called Pinpoint.
"FALCON: A System for Reliable Checkpoint Recovery in Shared Grid Environments": Tanzima Zerin, Saurabh Bagchi, and Rudolf Eigenmann. In: the ACM/IEEE Supercomputing Conference, November 14-20, 2009, Portland, Oregon. (Acceptance rate: 59/261 = 22.6%) (Nominated as one of 4 best student papers) [ Presentation ] [ Abstract ]

In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as the performance degradation is tolerable. For guest users, free resources come at the cost of unpredictable “failures”, where failures are defined as disruption in the guest job’s execution due to contention from the processes of the machine owner or the conventionally understood hardware and software failures. These unpredictable failures lead to unpredictable completion times. Checkpoint-recovery has long been used for providing reliability in failureprone computing environments. Today’s production FGCS systems, such as Condor, use expensive, high-performance dedicated checkpoint servers, even though they could take advantage of free disk resources offered by the clusters’ commodity machines. Also, in large, geographically distributed clusters, dedicated checkpoint servers may incur high checkpoint transfer latencies. In this paper we consider using available, free disk resources as shared storage hosts for serving as checkpoint repositories. Doing so raises new challenges in providing fault-tolerance, because a failing storage host may lead to a loss of saved application states. We model failures of such shared storage hosts and develop a prediction algorithm for such failures and then choosing an appropriate set of storage hosts. We describe the development of our system called Falcon in the production university-wide Condor testbed at Purdue University, named “BoilerGrid”. Through experiments on BoilerGrid, we show that Falcon provides improved and consistent performance to guest jobs by ensuring reliability and robustness in the presence of irregular resource availability.
"Stateful Detection in High Throughput Distributed Systems": Gunjan Khanna, Ignacio Laguna, Fahad A. Arshad, and Saurabh Bagchi. In: 26th IEEE International Symposium on Reliable Distributed Systems (SRDS-2007), pp. 275-287, Beijing, CHINA, October 10-12, 2007. (Acceptance rate: 29/185 ~ 15.7%) [ Presentation ] [ abstract ]

With the increasing speed of computers and the complexity of applications, many of today’s distributed systems exchange data at a high rate. Significant work has been done in error detection achieved through external fault tolerance systems. However, the high data rate coupled with complex detection can cause the capacity of the fault tolerance system to be exhausted resulting in low detection accuracy. We present a new stateful detection mechanism which observes the exchanged application messages, deduces the application state, and matches against anomalybased rules. We extend our previous framework (the Monitor) to incorporate a sampling approach which adjusts the rate of verified messages. The sampling approach avoids the previously reported breakdown in the Monitor capacity at high application message rates, reduces the overall detection cost and allows the Monitor to provide accurate detection. We apply the approach to a reliable multicast protocol (TRAM) and demonstrate its performance by comparing it with our previous framework.
"Distributed Diagnosis of Failures in a Three Tier E-Commerce System": Gunjan Khanna, Ignacio Laguna, Fahad A. Arshad, and Saurabh Bagchi. In: 26th IEEE International Symposium on Reliable Distributed Systems (SRDS-2007), pp. 185-198, Beijing, CHINA, October 10-12, 2007. (Acceptance rate: 29/185 ~ 15.7%) [ Presentation ] [ abstract ]

For dependability outages in distributed internet infrastructures, it is often not enough to detect a failure, but it is also required to diagnose it, i.e., to identify its source. Complex applications deployed in multi-tier environments make diagnosis challenging because of fast error propagation, black-box applications, high diagnosis delay, the amount of states that can be maintained, and imperfect diagnostic tests. Here, we propose a probabilistic diagnosis model for arbitrary failures in components of a distributed application. The monitoring system (the Monitor) passively observes the message exchanges between the components and, at runtime, performs a probabilistic diagnosis of the component that was the root cause of a failure. We demonstrate the approach by applying it to the Pet Store J2EE application, and we compare it with Pinpoint by quantifying latency and accuracy in both systems. The Monitor outperforms Pinpoint by achieving comparably accurate diagnosis with higher precision in shorter time.
"Automated Rule-Based Diagnosis through a Distributed Monitor System": Gunjan Khanna, Mike Yu Cheng, Padma Varadharajan, Saurabh Bagchi, Miguel P. Correia, and Paulo J. Verissimo. In: IEEE Transactions on Dependable and Secure Computing (TDSC), pp. 266-279, notificacion of acceptance: May 2007. [ abstract ]

In today's world where distributed systems form many of our critical infrastructures, dependability outages are becoming increasingly common. In many situations, it is necessary to not just detect a failure, but also to diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challenging since high throughput applications with frequent interactions between the different components allow fast error propagation. It is desirable to consider applications as black-boxes for the diagnostic process. In this paper, we propose a Monitor architecture for diagnosing failures in large-scale network protocols. The Monitor only observes the message exchanges between the protocol entities (PEs) remotely and does not access internal protocol state. At runtime, it builds a causal graph between the PEs based on their communication and uses this together with a rule base of allowed state transition paths to diagnose the failure. The tests used for the diagnosis are based on the rule base and are assumed to have imperfect coverage. The hierarchical Monitor framework allows distributed diagnosis handling failures at individual Monitors. The framework is implemented and applied to a reliable multicast protocol executing on our campus-wide network. Fault injection experiments are carried out to evaluate the accuracy and latency of the diagnosis.
"Failure-Aware Checkpointing in Fine-Grained Cycle Sharing Systems": Xiaojuan Ren, Rudolf Eigenmann, and Saurabh Bagchi. In: 16th IEEE International Symposium on High Performance Distributed Computing (HPDC-16), Monterey Bay, California, June 27-29, 2007. (Acceptance rate: 20%). [ Presentation ] [ abstract ]

Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amount of idle computational resources available on the Internet. Such systems allow guest jobs to run on a host if they do not significantly impact the local users of the host. Since the hosts are typically provided voluntarily, their availability fluctuates greatly. To provide fault tolerance to guest jobs without adding significant computational overhead, we propose failure-aware checkpointing techniques that apply the knowledge of resource availability to select checkpoint repositories and to determine checkpoint intervals. We present the schemes of selecting reliable and efficient repositories from the non-dedicated hosts that contribute their disk storage. These schemes are formulated as 0/1 programming problems to optimize the network overhead of transferring checkpoints and the work lost due to unavailability of a storage host when needed to recover a guest job. We determine the checkpoint interval by comparing the cost of checkpointing immediately and the cost of delaying that to a later time, which is a function of the resource availability. We evaluate the failure-aware techniques on an FGCS system called iShare, using tracebased simulation. The results show that our techniques achieve better application performance than the prevalent methods which use checkpointing with a fixed periodicity on dedicated checkpoint servers.
"Prediction of Resource Availability in Fine-Grained Cycle Sharing Systems and Empirical Evaluation", Xiaojuan Ren, Seyong Lee, Rudolf Eigenmann, and Saurabh Bagchi. In: Springer's Journal of Grid Computing (JOGC), vol. 5, no. 2, pp. 173-195, notification of acceptance: February 2007. [ abstract ]

Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amount of computational resources available on the Internet. In FGCS, host computers allow guest jobs to utilize the CPU cycles if the jobs do not significantly impact the local users. Such resources are generally provided voluntarily and their availability fluctuates highly. Guest jobs may fail unexpectedly, as resources become unavailable. To improve this situation, we consider methods to predict resource availability. This paper presents empirical studies on resource availability in FGCS systems and a prediction method. From studies on resource contention among guest jobs and local users, we derive a multi-state availability model. The model enables us to detect resource unavailability in a non-intrusive way. We analyzed the traces collected from a production FGCS system for three months. The results suggest the feasibility of predicting resource availability, and motivate our method of applying semi-Markov Process models for the prediction. We describe the prediction framework and its implementation in a production FGCS system, named iShare. Through the experiments on an iShare testbed, we demonstrate that the prediction achieves an accuracy of 86% on average and outperforms linear time series models, while the computational cost is negligible. Our experimental results also show that the prediction is robust in the presence of irregular resource availability. We tested the effectiveness of the prediction in a proactive scheduler. Initial results show that applying availability prediction to job scheduling reduces the number of jobs failed due to resource unavailability.
"Pesticide: Using SMT Processors to Improve Performance of Pointer Bug Detection", Jin-Yi Wang, Yen-Shiang Shue, T N Vijaykumar, and Saurabh Bagchi. 24th International Conference of Computer Design (ICCD), Oct 1-4, 2006, San Jose, California, USA. [ abstract ]

Pointer bugs associated with dynamically-allocated objects resulting in out-of-bounds memory access are an important class of software bugs. Because such bugs cannot be detected easily via static-checking techniques, dynamic monitoring schemes have been proposed. However, the key challenge with dynamic monitoring schemes is the runtime overhead (slowdowns of the order of 10x are common). Previous approaches have used thread-level speculation (TLS) to reduce the overhead. However, the approaches still incur substantial slowdowns while requiring complex TLS hardware. We make the key observation that because the monitor code and user code are largely and unambiguously independent, TLS hardware with all its complexity to handle speculative parallelism is unnecessary. We explicitly multithread the monitor code in which a thread checks one access and use SMT to exploit the parallelism in the monitor code. Despite multithreading the monitor code on SMT, dynamic monitoring slows down the user thread due to two problems: instruction overhead and insufficient overlap among the monitor threads. To address instruction overhead, we exploit the natural locality in the user thread addresses and memoize recent checks in a small table called the allocation-record-cache (ARC). However, programs making and accessing many small memory allocations cause many ARC misses and incur significant runtime overhead. To address this issue, we make a second key observation that because adjacent memory objects result in ARC entries with contiguous address ranges, the entries can be merged into one by simply merging the ranges into one. This merging increases the effective size of the ARC. Finally, insufficient overlap among monitor threads occurs because of inefficient synchronization to protect the allocation data structure updated by the user thread and read by the monitor threads. We make the third key observation that because monitor-thread reads occur for every check but user-thread writes occur only in allocations and deallocations, monitor reads are much more frequent than user writes. We propose a locking strategy, called biased lock, which puts the locking overhead on the writer away from the readers. We show that starting from a runtime overhead of 414% our scheme reduces this overhead to a respectable 24% running three monitor threads on an SMT using a 256-entry ARC with merging and biased lock.
"Providing Automated Detection of Problems in Virtualized Servers using Monitor framework", Gunjan Khanna, Saurabh Bagchi, Kirk Beaty, Andrzej Kochut, and Gautam Kar. Workshop on Applied Software Reliability (WASR) at the International Conference on Dependable Systems and Networks (DSN), June 25-28, 2006, Philadelphia, Pennsylvania, USA. [ Presentation ] [ abstract ]

Increasing management costs with large servers have caused adoption of a technique called Server Virtualization whereby multiple virtual machines can be hosted on the same physical machine. Reduction in the number of physical machines and amount of rack space brings the total cost of ownership down. Virtual servers residing on the same physical machine are likely to suffer resource contention under varying workload conditions. We are targeting distributed environments where some services are executed on virtual servers. Existing management frameworks fail to address the problems faced in such environments because of a narrow focus only at system metrics which are not sufficient to explain the application problems. Also the management frameworks cannot detect problems arising out of virtualization. In this paper we propose a hierarchical, scalable, non-intrusive Monitor framework which addresses not only system level problems but also problems in application semantics. We show how design of Monitor framework is different from existing management frameworks making it suitable for virtualized environments. We demonstrate the specific class of problems which the Monitor can detect in a virtualized environment running an e-commerce application.
"Resource Failure Prediction in Fine-Grained Cycle Sharing Systems", Xiaojuan Ren, Seyong Lee, Rudolf Eigenmann, and Saurabh Bagchi. 15th IEEE International Symposium on High Performance Distributed Computing (HPDC-15), 19-23 June 2006, Paris, France. (Acceptance rate: 24/157 ~ 15%). [ Presentation ] [ abstract ]

Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amount of computational resources available on the Internet. In FGCS, host computers allow guest jobs to utilize the CPU cycles if the jobs do not significantly impact the local host users. A characteristic of such resources is that they are generally provided voluntarily and their availability fluctuates highly. Guest jobs may incur resource failures because of unexpected resource unavailability. Checkpointing and migration techniques help overcome such failures. However, these techniques, if oblivious to future failures, may cause significant overhead and thus undesirable job response times. This paper presents a method to predict resource failures in FGCS systems. The prediction method enables proactive management with greatly improved job response times. It applies a semi-Markov Process and is based on a novel failure model, combining generic hardware-software failures with domain-specific failures in FGCS. We describe the failure prediction framework and its implementation in a production FGCS system named iShare. Through the experiments on an iShare testbed, we demonstrate that the prediction achieves accuracy above 86% on average and outperforms linear time series models, while the computational cost is negligible. Our experimental results also show that the prediction is robust in the presence of irregular resource failures.
"Automated Online Monitoring of Distributed Applications through External Monitors", Gunjan Khanna, Padma Varadharajan, and Saurabh Bagchi. IEEE Transactions on Dependable and Secure Computing (TDSC), vol. 3, no. 2, pp. 115-129, Apr-Jun, 2006. [ abstract ]

It is a challenge to provide detection facilities for large scale distributed systems running legacy code on hosts that may not allow fault tolerant functions to execute on them. It is tempting to structure the detection in an observer system that is kept separate from the observed system of protocol entities, with the former only having access to the latter’s external message exchanges. In this paper we propose an autonomous self checking Monitor system, which is used to provide fast detection to underlying network protocols. The Monitor architecture is application neutral and therefore lends itself to deployment for different protocols, with the rulebase against which the observed interactions are matched making it specific to a protocol. To make the detection infrastructure scalable and dependable, we extend it to a hierarchical Monitor structure. The Monitor structure is made dynamic and reconfigurable by designing different interactions to cope with failures, load changes, or mobility. The latency of the Monitor system is evaluated under fault free conditions, while its coverage is evaluated under simulated error injections.
"Probabilistic Diagnosis through Non-Intrusive Monitoring in Distributed Applications", Gunjan Khanna, Yu Cheng, Saurabh Bagchi, Miguel Correia, and Paolo Verissimo. Purdue ECE Technical Report 05-19, December 2005. [ abstract ]

With dependability outages in distributed critical infrastructures, it is often not enough to detect a failure, but it is also required to diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challenging because fast error propagation may occur in high throughput distributed applications. The diagnosis often needs to be probabilistic in nature due to imperfect observability of the payload system, inability to do white-box testing, constraints on the amount of state that can be maintained at the diagnostic process, and imperfect tests used to verify the system. In this paper, we extend an existing Monitor architecture, for probabilistic diagnosis of failures in large-scale network protocols. The Monitor only observes the message exchanges between the protocol entities (PEs) remotely and does not access internal protocol state. At runtime, it builds a causal & aggregate graph between the PEs based on their communication and uses this together with a rule base for diagnosing the failure. The Monitor computes for each suspected PE, a probability for the error having originated in that PE and propagated to the failure detection site. The framework is applied to a test-bed consisting of a reliable multicast protocol executing on the Purdue campus-wide network. Error injection experiments are performed to evaluate the accuracy and the performance overhead of the diagnostic process.
"Automated Monitor Based Diagnosis in Distributed Systems", Gunjan Khanna, Padma Varadharajan, Mike Cheng, and Saurabh Bagchi, Purdue ECE Technical Report 05-13, August 2005. [ abstract ]

In today's world where distributed systems form many of our critical infrastructures, dependability outages are becoming increasingly common. In many situations, it is necessary to not just detect a failure, but also to diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challenging since high throughput applications with frequent interactions between the different components allow fast error propagation. It is desirable to consider applications as black-boxes for the diagnosis process. In this paper, we propose a Monitor architecture for diagnosing failures in large-scale network protocols. The Monitor only observes the message exchanges between the protocol entities (PEs) remotely and does not access internal protocol state. At runtime, it builds a causal graph between the PEs based on their communication and uses this together with a rule base of allowed state transition paths to diagnose the failure. The hierarchical Monitor framework allows distributed diagnosis tolerating Byzantine failures at individual Monitors. The framework is implemented and applied to a reliable multicast protocol executing on our campus-wide network. Fault injection experiments are carried out to evaluate the accuracy and latency of the diagnosis.
"LRRM: A Randomized Reliable Multicast Protocol for Optimizing Recovery Latency and Buffer Utilization", Nipoon Malhotra, Shrish Ranjan, and Saurabh Bagchi. 24th IEEE Symposium on Reliable Distributed Systems (SRDS 2005), October 26-28, 2005, Orlando, Florida, USA.(Acceptance rate: 20/67 ~ 29.9%) [ Camera ready ] [ abstract ]

An efficient recovery protocol for lost messages is crucial for supporting reliable multicasting. The treebased recovery protocols group nodes into recovery regions and designate a recovery node per region for buffering and retransmitting lost messages. In these protocols, the recovery host may get overloaded during periods of large message losses and costly remote recovery may be initiated even though a peer node has the lost message. To address these drawbacks, the Randomized Reliable Multicast Protocol (RRMP) was proposed which distributes the responsibility of error recovery among all members in a group. The pressure on the buffer and computational resources on the intermediate nodes is increasing due to the wide distribution of multicast participants with widely varying reception rates and periodic disconnections. In this paper, we propose the Lightweight Randomized Reliable Multicast (LRRM) protocol that optimizes the amount of buffer space by providing an efficient mechanism based on best-effort multicast for retrieving a lost message. A theoretical analysis and a simulation based study of two realistic topologies indicate that LRRM provides comparable recovery latency to RRMP for lower buffer space usage. While presented in the context of RRMP, LRRM can also benefit other treebased reliable multicast protocols.
"Self Checking Network Protocols: A Monitor Based Approach", Gunjan Khanna, Padma Varadharajan, and Saurabh Bagchi. 23rd International Symposium on Reliable Distributed Systems (SRDS 2004), October 2004. (Acceptance rate:27/117 ~ 23.1%)
[ Camera Ready ] [ Presentation ] [ abstract ]

The wide deployment of high-speed computer networks has made distributed systems ubiquitous in today’s connected world The machines on which the distributed applications are hosted are heterogeneous in nature, the applications often run legacy code without the availability of their source code, the systems are of very large scales, and often have soft real-time guarantees. In this paper, we target the problem of online detection of disruptions through a generic external entity called Monitor that is able to observe the exchanged messages between the protocol participants and deduce any ongoing disruption by matching against a rule base composed of combinatorial and temporal rules. The Monitor architecture is application neutral, with the rule base making it specific to a protocol. To make the detection infrastructure scalable and dependable, we extend it to a hierarchical Monitor structure. The infrastructure is applied to a streaming video application running on a reliable multicast protocol called TRAM installed on the campus wide network. The evaluation brings out the scalability of the Monitor infrastructure and detection coverage under different kinds of faults for the single level and the hierarchical arrangements.
"Self-Checking Network Protocols: A Monitor Based Approach", Gunjan Khanna, MS Thesis. December 2003. [ abstract ]

The wide deployment of high-speed computer networks has made distributed systems ubiquitous in today’s connected world. The systems are affected by disruption i.e. errors within the protocol or intrusions. This motivates the need for building distributed systems that are capable of tolerating disruptions and providing highly available and correctly functioning services. The machines on which the applications are hosted are heterogeneous in nature, the applications often run legacy code without the availability of their source code, the systems are of very large scales (of the order of tens of thousands of protocol participants) and the systems often have soft real-time guarantees. While it may be possible to devise very optimized and targeted solutions for individual distributed applications, such approaches are not very interesting from a research standpoint due to their limited applicability. In developing this thesis we have focused on Monitor based detection of disruptions in a distributed environment. Monitor detects the disruptions by looking at only the external message exchanges, without looking at the internal transitions of the monitored entity. It is made to run asynchronously to the application thus avoiding the performance bottleneck. We have chosen a black box Monitor approach suitable for any generic protocol. By developing the "Monitor Based Detection Approach", aim is to provide higher reliability and dependability. We propose a Hierarchical Monitoring approach by placing a hierarchy of local and Global Monitors in the system. A Local Monitor only monitors a set of local nodes while a Global Monitor can have several local monitors reporting local interactions to it. This provides increased coverage and accuracy of detection. The Monitor consists of a Rule Classifier, Data Capture and Matching Engine as the main components. The rules are classified into Local and Global rules intelligently by the rule classifier. The Matching Engine consists of fast matching algorithms each for Temporal and Combinatorial rules. Testing of the Monitor is done on a Distributed Reliable Multicast Protocol called TRAM. The Monitor is tested by injecting faults into the running protocol using a Fault Injector.
"Failure Handling in a Reliable Multicast Protocol for Improving Buffer Utilization and Accommodating Heterogeneous Receivers", Gunjan Khanna, John Rogers, and Saurabh Bagchi. In Proceedings of the 10th IEEE Pacific Rim Dependable Computing Conference (PRDC' 04), March 2004. (Acceptance rate: 34/102 ~ 33.3%) [ Camera ready ] [ abstract ]

Reliable multicast protocols are an important class of protocols for reliably disseminating information from a sender to multiple receivers in the face of node and link failures. A Tree-based Reliable Multicast Protocol (TRAM) provides scalable reliable multicast by grouping receivers in hierarchical repair groups and using a selective acknowledgment mechanism. In this paper, we present an improvement to TRAM to minimize the resource utilization at intermediate hosts and to localize the effect of slow or malicious receivers on normal receivers. We present an evaluation of the existing protocol on a campus-wide wide area network (WAN) with respect to end-to-end parameters (latency, jitter, data rate), and resource utilization (buffer utilization and memory usage at sender and intermediate hosts). A high-quality streaming mpeg video application is used as the workload for the study. The evaluation brings out the effect of scaling the number of receivers on the parameters. The study is done for the error-free case and with message delay, drop and reordering errors. Then we present the design of an augmented protocol, called TRAM++, that optimizes the buffer requirement at intermediate hosts and improves the end-to-end parameters for normal receivers in the presence of some slow or malicious receivers. An implementation of TRAM++ is provided and it is evaluated on the campus-wide WAN with and without errors. The evaluation brings out that, given a constraint on the buffer availability at intermediate hosts of 16% of maximum buffer usage in TRAM, TRAM++ can tolerate the constraint at the expense of increasing the end-to-end latency for the normal receivers by only 3.2% in error-free cases. When slow or faulty receivers are present, TRAM++ is able to provide the same uninterrupted quality of service to the normal nodes while localizing the effect of the faulty ones without incurring any additional memory overhead.
"Light-Weight Randomized Reliable Multicasting Protocol", Nipoon Malhotra, Shrish Ranjan, and Saurabh Bagchi. Appeared in Fast Abstracts, DSN 2003. [ abstract ]

Multicasting is an efficient way of distributing data from a sender to multiple receivers. There has been considerable interest in augmenting the best-effort nature of IP multicast protocols to support reliable multicast capable of tolerating node crashes and message losses. The basic tree-based protocols (RMTP, TRAM) suffer from the problem that under failures, a local designated recovery host may get overloaded, and costly remote recovery may be performed even if a host in the local region has the message being requested. A protocol proposed to solve the problems is the Randomized Reliable Multicast Protocol (RRMP). Buffer management techniques proposed for RRMP [2] involve a trade off between lost message recovery latency and the amount of the buffer space used. In this paper, we propose a protocol called Light-weight Randomized Reliable Multicast (LRRM) which uses an alternative lost message recovery strategy that leads to lesser buffer requirement without compromising the recovery latency. As the reliable multicast protocols are deployed over wide area networks, it is a likely scenario that the intermediate nodes are light-weight and constrained in their buffer space and processing capabilities. Also the receivers may have widely varying reception rates and periods of disconnection resulting in large buffer space requirements. This motivates LRRM.

Dependable Cyber-Physical Systems

"Optimizing Defensive Investments in Energy-Based Cyber-Physical Systems, Paul Wood, Saurabh Bagchi, and Alefiya Hussain. Accepted to appear at the Dependable Parallel, Distributed and Network-Centric Systems (DPDNS) Workshop, to be held with 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), pp. 1-10, May 25-29, 2015, Hyderabad, India.
[Presentation ] [ Abstract ]

Interdependent cyber-physical systems (CPS) connect physical resources between independently motivated actors who seek to maximize profits while providing physical services to consumers. Cyberattacks in seemingly distant parts of these systems have local consequences, and techniques are needed to analyze and optimize defensive costs in the face of increasing cyber threats. This paper presents a technique for transforming physical interconnections between independent actors into a dependency analysis that can be applied to find optimal defensive investment strategies to protect assets from financially motivated adversaries in electric power grids.
"TARDIS: Software-Only System-Level Record and Replay in Wireless Sensor Networks, "Matthew Tancreti, Vinaitheerthan Sundaram, Saurabh Bagchi, and Patrick Eugster. Accepted to appear at the 14th ACM/IEEE Conference on Information Processing in Sensor Networks (IPSN), pp. 1-12, April 13-17, 2015, Seattle, WA. (Acceptance rate: 27/111 = 24.3%)
[Presentation ] [ Abstract ]

Wireless sensor networks (WSNs) are plagued by the possibility of bugs manifesting only at deployment. However, debugging deployed WSNs is challenging for several reasons— the remote location of deployed sensor nodes, the non-determinism of execution that can make it difficult to replicate a buggy run, and the limited hardware resources available on a node. In particular, existing solutions to record and replay debugging in WSNs fail to capture the complete code execution, thus negating the possibility of a faithful replay and causing a large class of bugs to go unnoticed. In short, record and replay logs a trace of predefined events while a deployed application is executing, enabling replaying of events later using debugging tools. Existing recording methods fail due to the many sources of non-determinism and the scarcity of resources on nodes. In this paper we introduce Trace And Replay Debugging In Sensornets (Tardis), a software-only approach for deterministic record and replay of WSN nodes. Tardis is able to record all sources of non-determinism, based on the observation that such information is compressible using a combination of techniques specialized for respective sources. Despite their domain-specific nature, the techniques presented are applicable to the broader class of resource-constrained embedded systems. We empirically demonstrate the viability of our approach and its effectiveness in diagnosing a newly discovered bug in a widely used routing protocol.
"Using Big Data for Improving Dependability: A Cellular Network Tale," Nawanol Theera-Ampornpunt, Saurabh Bagchi; Kaustubh Joshi, Rajesh Panta (AT&T Labs). At the 9th Workshop on Hot Topics in Dependable Systems (HotDep), held in conjunction with the 24th ACM Symposium on Operating Systems Principles (SOSP), pp. 1-6, November 3, 2013, Nemacolin Woodlands Resort, PA. (Acceptance rate: 11/21 = 52.3%) [ Abstract ]

There are many large infrastructures that instrument everything from network performance metrics to user activities. However, the collected data are generally used for long-term planning instead of improving reliability and user experience in real time. In this paper, we present our vision of how such collections of data can be used in real time to enhance the dependability of cellular network services. We first discuss mitigation mechanisms that can be used to improve reliability, but incur a high cost which prohibit them to be used except in certain conditions. We present two case studies where analyses of real cellular network traffic data show that we can identify these conditions.
"Low-Complexity Secure Protocols to Defend Cyber-Physical Systems Against Network Isolation Attacks," Dong-Hoon Shin, Jinkyu Koo, Lei Yang (Arizona State U.), Xiaojun Lin, Saurabh Bagchi, and Junshan Zhang (Arizona State U.). At the 1st IEEE Conference on Communications and Network Security (CNS), pp. 1-9, Washington DC, October 14-16, 2013. (Acceptance rate: 40/141 = 28.4%) [ Abstract ]

This paper studies the network isolation attack, a devastating type of attacks on cyber-physical systems. In this attack, an adversary compromises a set of nodes that enclose a region in order to isolate the region from the rest of the network. Assuming that the compromised nodes wish not to be detected, we propose a solution to defend against the network isolation attack. Our goal is to achieve the following security guarantee: either a legitimate node can successfully deliver a message to another legitimate node, or the network control center can identify a small set of suspect nodes, which are guaranteed to contain a compromised node. Toward achieving this goal, we develop two protocols: one is for secure delivery of messages among nodes and the other is for secure collection of messages from nodes at the network control center. We show that our proposed protocols are provably secure, i.e., attain the aforementioned security guarantee. Further, our protocols achieve this guarantee with overhead that is orders-of-magnitude smaller than existing baseline protocols. Our proposed protocols are thus scalable for large networks.
"A Delay-Bounded Event-Monitoring and Adversary-Identification Protocol in Resource-Constrained Sensor Networks," Jinkyu Koo, Dong-Hoon Shin, Xiaojun Lin, and Saurabh Bagchi. Elsevier Ad Hoc Networks, pp. 1820-1835, vol. 11, issue 6, August 2013. [ Abstract ]

Event monitoring is a common application in wireless sensor networks. For event monitoring, a number of sensor nodes are deployed to monitor certain phenomenon. When an event is detected, the sensor nodes report it to a base station (BS), where a network operator can take appropriate action based on the event report. In this paper, we are interested in scenarios where the event must be reported within a time bound to the BS possibly over multiple hops. However, such event reports can be hampered by compromised nodes in the middle that drop, modify, or delay the event report. To defend against such an attack, we propose SEM, a Secure Event Monitoring protocol against arbitrary malicious attacks by Byzantine adversary nodes. SEM provides the following provable security guarantees. As long as the compromised nodes want to stay undetected, a legitimate sensor node can report an event to the BS within a bounded time. If the compromised nodes prevent the event from being reported to the BS within the bounded time, the BS can identify a pair of nodes that is guaranteSchool of Electrical and Computer Engineeringed to contain at least one compromised node. To the best of our knowledge, no prior work in the literature can provide such guarantees. SEM is designed to use the minimumlevel of asymmetric cryptography during normal operationwhenthere is no attack, and use cryptographic primitives more liberallywhenan attack is detected. This design has the advantage that the overall SEM protocol is lightweight in terms of the computational resources and the network traffic required by the cryptographic operations. We also show an operational example of SEM using TOSSIM simulations.
"An Optimization Framework for Monitoring Multi-Channel Multi-Radio Wireless Mesh Networks," Dong-Hoon Shin and Saurabh Bagchi. Elsevier Ad Hoc Networks, pp. 926-943, Volume 11, Issue 3, May 2013. [ Abstract ]

This paper studies an optimal monitoring problem for behavior-based detection in multichannel multi-radio wireless mesh networks. In behavior-based detection, nodes overhear communications in their neighborhood to determine if the behaviors of their neighbors are legitimate. The objective of this work is to maximize the number of nodes being monitored by judiciously choosing a set of monitoring nodes and also channels for the chosen monitoring nodes. This problem is NP-hard, growing exponentially with the number of monitoring nodes. We develop three approximation algorithms, each of which achieves at least a constant factor of the optimum. Furthermore, one of our algorithms achieves the best possible approximation ratio among all polynomial-time algorithms, unless P = NP. We conduct simulations in random networks and scale-free networks to evaluate the coverage and the execution time of the three algorithms.
"Lilliput meets Brobdingnagian: Data Center Systems Management through Mobile Devices," Saurabh Bagchi, Fahad Arshad, Jan Rellermeyer (IBM Research), Thomas Osiecki (IBM Research), Michael Kistler (IBM Research), and Ahmed Gheith (IBM Research). At the 3rd International Workshop on Dependability of Clouds, Data Centers and Virtual Machine Technology (DCDV 2013), held with the 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1-6, June 24, 2013, Budapest, Hungary. [ Presentation ] [ Abstract ]

In this paper, we put forward the notion that systems management for large masses of virtual machines in data centers is going to be done differently in the short to medium term future?through smart phones and through controlled crowdsourcing to a variety of experts within an organization, rather than dedicated system administrators alone. We lay out the research and practitioner challenges this model raises and give some preliminary solution directions that are being developed, here at IBM and elsewhere.
"Toward Optimal Sniffer-Channel Assignment for Reliable Monitoring in Multi-Channel Wireless Networks," Dong-Hoon Shin, Saurabh Bagchi, and Chih-Chun Wang. At the 10th IEEE International Conference on Sensing, Communication, and Networking (SECON), pp. 1-9, New Orleans, LA, June 24-27, 2013. (Acceptance rate: 51/173 = 29.5%) [ Presentation ] [ Abstract ]

This paper studies the optimal sniffer-channel assignment for reliable monitoring in multi-channel wireless networks. This problem concerns how to deploy certain sniffers in a network (and tune their channels) so that they can overhear and verify communication among the other nodes, referred to as normal nodes. Prior works have studied the optimal sniffer-channel assignment, but they assume perfect sniffers. However, in practice, sniffers may probabilistically make errors in monitoring, e.g., due to poor reception and compromise by an adversary. Hence, to maintain acceptable monitoring quality, a node needs to be overheard by multiple sniffers. We show that the optimal sniffer-channel assignment with sniffer redundancy differs fundamentally from the previous works due to the absence of a desirable property called submodularity. As a result, in our problem, the prior approximation algorithms no longer maintain their performance guarantees. We propose a variety of approximation algorithms based on two approaches?greedy strategy and relaxation-and-rounding approach. We present an empirical performance analysis of the proposed algorithms through simulations in practical networks. Our results suggest that our two algorithms show a performance trade-off between coverage and running time and are therefore suitable for different kinds of deployment.
"Mitigating the Effects of Software Component Shifts for Incremental Reprogramming of Wireless Sensor Networks," Rajesh Krishna Panta and Saurabh Bagchi. IEEE Transactions on Parallel and Distributed Systems (TPDS), pp. 1882-1894, vol. 23, issue 10, October 2012. [ Abstract ]

Wireless reprogramming of sensor nodes is an essential requirement for long-lived networks because software functionality needs to be changed over time. The amount of information that needs to be wirelessly transmitted during reprogramming should be minimized to reduce reprogramming time and energy. In this paper, we present a multihop incremental reprogramming system called Hermes that transfers over the network the delta between the old and new software and lets the sensor nodes rebuild the new software using the received delta and the old software. It reduces the delta by using techniques to mitigate the effects of function and global variable shifts caused by the software modifications. Then it compares the binary images at the byte level with a method to create a small delta that needs to be sent over the wireless network to all the nodes. For the wide range of software change scenarios that we experimented with, we find that Hermes transfers up to 201 times less information than Deluge, the standard reprogramming system for TinyOS, and 64 times less than an existing incremental reprogramming system by Jeong and Culler.
"Multi-Armed Bandit Congestion Control in Multi-Hop Infrastructure Wireless Mesh Networks,” A. B. M. Alim Al Islam, S. M. Iftekharul Alam, Vijay Raghunathan, and Saurabh Bagchi. At the 20th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 1-10, August 7-9, 2012, Arlington, VA. (Acceptance rate: 49/136 = 36%, our paper was one of 7 accepted papers selected for 10 page publication limit, the others have an 8 page limit) [ Abstract ]

Congestion control in multi-hop infrastructure wire- less mesh networks is both an important and a unique problem. It is unique because it has two prominent causes of failed transmissions which are difficult to tease apart - lossy nature of wireless medium and high extent of congestion around gateways in the network. The concurrent presence of these two causes limits applicability of already available congestion control mechanisms, proposed for wireless networks. Prior mechanisms mainly focus on the former cause, ignoring the latter one. Therefore, we address this issue to design an end-to-end congestion control mechanism for infrastructure wireless mesh networks in this paper. We formulate the congestion control problem and map that to the restless multi-armed bandit problem, a well-known decision problem in the literature. Then, we propose three myopic policies to achieve a near-optimal solution for the mapped problem since no optimal solution is known to this problem. We perform comparative evaluation through ns-2 simulation and a real testbed experiment with a wireline TCP variant and a wireless TCP protocol. The evaluation reveals that our proposed mechanism can achieve up to 52% increased network throughput and 34% decreased average energy consumption per transmitted bit in comparison to the other end-to-end congestion control variants.
"An Empirical Study of the Robustness of Inter-component Communication in Android," Amiya K. Maji, Fahad A. Arshad, Saurabh Bagchi and Jan S. Rellermeyer (IBM Research, Austin) In the 42th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 12 pages, Boston, MA, June 25-28, 2012 [ Presentation ] [ Abstract ]

Over the last three years, Android has established itself as the largest-selling operating system for smartphones. It boasts of a Linux-based robust kernel, a modular framework with multiple components in each application, and a security- conscious design where each application is isolated in its own virtual machine. However, all of these desirable properties would be rendered ineffectual if an application were to deliver erroneous messages to targeted applications and thus cause the target to behave incorrectly. In this paper, we present an empir- ical evaluation of the robustness of Inter-component Commu- nication (ICC) in Android through fuzz testing methodology, whereby, parameters of the inter-component communication are changed to various incorrect values. We show that not only exception handling is a rarity in Android applications, but also it is possible to crash the Android runtime from unprivileged user processes. Based on our observations, we highlight some of the critical design issues in Android ICC and suggest solutions to alleviate these problems.
"Aveksha: A Hardware-Software Approach for Non-intrusive Tracing and Profiling of Wireless Embedded Systems,": Matthew Tancreti, Mohammad Hossain, Saurabh Bagchi, and Vijay Raghunathan. In: 9th ACM Conference on Embedded Networked Sensor Systems (SenSys), 14 pages, Seattle, WA, Nov 1-4, 2011. (Winner of the best paper award) (Acceptance rate: 24/123 = 19.5%) [ Presentation ] [ Abstract ]

It is important to get an idea of the events occurring in an embedded wireless node when it is deployed in the field, away from the convenience of an interactive debugger. Such visibility can be useful for post-deployment testing, replaybased debugging, and for performance and energy profiling of various software components. Prior software-based solutions to address this problem have incurred high execution overhead and intrusiveness. The intrusiveness changes the intrinsic timing behavior of the application, thereby reducing the fidelity of the collected profile. Prior hardware-based solutions have involved the use of dedicated ASICs or other tightly coupled changes to the embedded node?s processor, which significantly limits their applicability. In this paper, we present AVEKSHA, a hardware-software approach for achieving the above goals in a non-intrusive manner. Our approach is based on the key insight that most embedded processors have an on-chip debug module (which has traditionally been used for interactive debugging) that provides significant visibility into the internal state of the processor. We design a debug board that interfaces with the on-chip debug module of an embedded node?s processor through the JTAG port and provides three modes of event logging and tracing: breakpoint, watchpoint, and program counter polling. Using expressive triggers that the onchip debug module supports, AVEKSHA can watch for, and record, a variety of programmable events of interest. A key feature of AVEKSHA is that the target processor does not have to be stopped during event logging (in the last two of the three modes), subject to a limit on the rate at which logged events occur. AVEKSHA also performs power monitoring of the embedded wireless node and, importantly, enables power consumption data to be correlated to events of interest. AVEKSHA is an operating system-agnostic solution. We demonstrate its functionality and performance using three applications running on the TelosB motes; two in TinyOS and one in Contiki. We show that AVEKSHA can trace tasks and other generic events and can perform energy monitoring at function and task-level granularity. We also describe how we used AVEKSHA to find a subtle bug in the TinyOS low power listening protocol.
"Distributed Online Channel Assignment Toward Optimal Monitoring in Multi-Channel Wireless NetworksDong-Hoon Shin, Saurabh Bagchi, and Chih-Chun Wang, At the 31st Annual IEEE International Conference on Computer Communications (INFOCOM) Mini-conference, pp. 2918-2912, 2012. [ Presentation ] [ Abstract ]

This paper studies an optimal channel assignment problem for passive monitoring in multi-channel wireless networks, where a set of sniffers capture and analyze the network traffic to monitor the network. The objective of this problem is to maximize the total amount of traffic captured by sniffers by judiciously assigning the radios of sniffers to a set of channels. This problem is NP-hard, with the computational complexity growing exponentially with the number of sniffers. We develop distributed online solutions to this problem for large-scale and dynamic networks. Prior works have attained a constant factor1 - 1/e of the maximum monitoring coverage in a centralized setting. Our algorithm preserves the same ratio while providing a distributed solution that is amenable to online implementation. Also, our algorithm is cost-effective, in terms of communication and computational overheads, due to the use of only local communication and the adaptation to incremental network changes. We present two operational modes of our algorithm for two types of networks that have different rates of network changes. One is a proactive mode for fast varying networks, while the other is a reactive mode for slowly varying networks. Simulation results demonstrate the effectiveness of the two modes of our algorithm.
Efficient Incremental Code Update for Sensor Networks,” Rajesh Krishna Panta, Saurabh Bagchi, and Samuel P. Midkiff, ACM Transactions on Sensor Networks, pp. 1-32, Vol. 7, No. 4, February 2011. [ Abstract ]

Wireless reprogramming of sensor nodes is an essential requirement for long-lived networks since software functionality needs to be changed over time. During reprogramming, the number of radio transmissions should be minimized, since reprogramming time and energy depend chiefly on the number of radio transmissions. In this article, we present a multihop incremental reprogramming protocol called Zephyr that transfers the delta between old and new software versions, and lets the sensor nodes rebuild the new software using the received delta and the old software. Zephyr reduces the delta size by using application-level modifications to mitigate the effects of function shifts. Then it compares the two binary images at the byte level to generate a small delta, that is then sent over the wireless network to all the nodes. For the wide range of software change cases that we used as benchmarks, Zephyr transfers 1.83 to 1987 times less traffic through the network than Deluge (the standard nonincremental reprogramming protocol for TinyOS) and 1.14 to 49 times less traffic than an existing incremental reprogramming protocol by Jeong and Culler [2004].
Dependence-based Multi-level Tracing and Replay for Wireless Sensor Networks Debugging,” Man Wang, Zhiyuan Li, Feng Li, Xiaobing Feng, Saurabh Bagchi, and Yung-Hsiang Lu, At the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, Tools and Theory for Embedded Systems (LCTES), 10 pages, April 12-14, 2011. (Acceptance rate: 17/48 = 35.4%) [ Abstract ]

Due to resource constraints and unreliable communication, wireless sensor network (WSN) programming and debugging remain to be challenging tasks. Deterministic replay is an error diagnosis method which has long been proposed for distributed systems. However, one of the significant hurdles for applying deterministic replay on WSN is posed by the small program memory on typical sensor nodes. This paper proposes a dependence-based multi-level method for memory-efficient tracing and replay. In the interest of portability across different hardware platforms, the method is implemented as a source-level tracing and replaying tool. To further reduce the code size after tracing instrumentation, a cost model is used for making the decision on which functions to in-line. A prototype for the tool targets C programs is developed on top of the Open64 compiler and is tested using several TinyOS applications running on TelosB motes. Preliminary experimental results show that the test programs, which do not fit the program memory after straightforward instrumentation, can be successfully instrumented using the new method such that the injected errors can be found.
Secure neighbor discovery through overhearing in static multihop wireless networks," Srikanth Hariharan, Ness B. Shroff, and Saurabh Bagchi, In Elsevier Computer Networks, pp. 1-13, in Preprint, doi:10.1016/j.comnet.2010.10.021. [ Abstract ]

In wireless ad-hoc and sensor networks, neighbor discovery is one of the first steps performed by a node upon deployment and disrupting it adversely affects a number of routing, MAC, topology discovery and intrusion detection protocols. It is especially harmful when an adversary can convince nodes that it is a legitimate neighbor, which it can do easily and without the use of cryptographic primitives. In this paper, we develop a secure neighbor discovery protocol, SEDINE, for static multihop wireless networks. We prove that, in the absence of packet losses, without using any centralized trusted node or specialized hardware, SEDINE prevents any node, legitimate or malicious, from being incorrectly added to the neighbor list of another legitimate node that is not within its transmission range. We provide simulation results to demonstrate the efficacy of SEDINE, in the presence of packet losses.
Stealthy Attacks in Wireless Ad Hoc Networks: Detection and Countermeasure,” Issa Khalil and Saurabh Bagchi, Accepted for publication in IEEE Transactions on Mobile Computing, pp. 1-15, in Preprint, http://doi.ieeecomputersociety.org/10.1109/TMC.2010.249. [ Abstract ]

Stealthy packet dropping is a suite of four attacks --- misrouting, power control, identity delegation and colluding collision --- that can be easily launched against multihop wireless ad hoc networks. Stealthy packet dropping disrupts the packet from reaching the destination through malicious behavior at an intermediate node. However, the malicious node gives the impression to its neighbors that it performs the legitimate forwarding action. Moreover, a legitimate node comes under suspicion. A popular method for detecting attacks in wireless networks is behaviorbased detection performed by normal network nodes through overhearing the communication in their neighborhood. This leverages the open broadcast nature of wireless communication. An instantiation of this technology is local monitoring. We show that local monitoring, and the wider class of overhearing-based detection, cannot detect stealthy packet dropping attacks. Additionally, it mistakenly detects and isolates a legitimate node. We present a protocol called SADEC that can detect and isolate stealthy packet dropping attack efficiently. SADEC presents two techniques that can be overlaid on basic local monitoring: having the neighbors maintain additional information about the routing path, and adding some checking responsibility to each neighbor. Additionally, SADEC provides an innovative mechanism to better utilize local monitoring by considerably increasing the number of nodes in a neighborhood that can do monitoring. We show through analysis and simulation experiments that basic local monitoring fails to efficiently mitigate most of the presented attacks while SADEC successfully mitigates them.
Fixed Cost Maintenance for Information Dissemination in Wireless Sensor Networks,” Rajesh Krishna Panta, Madalina Vintila, and Saurabh Bagchi, At the 29th IEEE Symposium on Reliable Distributed Systems (SRDS), pp. 54-63, October 31-November 3, 2010, New Delhi, India. (Acceptance rate: 21/93 = 22.6%)[ Presentation ] [ Abstract ]
Because of transient wireless link failures, incremental node deployment, and node mobility, existing information dissemination protocols used in wireless ad-hoc and sensor networks cause nodes to periodically broadcast "advertisement" containing the version of their current data item even in the "steady state" when no dissemination is being done. This is to ensure that all nodes in the network are up-to-date. This causes a continuous energy expenditure during the steady state, which is by far the dominant part of a network?s lifetime. In this paper, we present a protocol called Varuna which incurs a constant energy cost, independent of the duration of the steady state. In Varuna, nodes monitor the traffic pattern of the neighboring nodes to decide when an advertisement is necessary. Using testbed experiments and simulations, we show that Varuna achieves several orders of magnitude energy savings compared to Trickle, the existing standard for dissemination in sensor networks, at the expense of a reasonable amount of memory for state maintenance.
RDAS: Reputation-Based Resilient Data Aggregation in Sensor Network": Carlos Perez-Toro, Rajesh Krishna Panta, and Saurabh Bagchi. In the 7th Annual IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks (SECON), 9 pages, June 21-25, 2010, Boston, Massachusetts. (Acceptance rate: 63/274 = 23.0%) [ Presentation ] [ Abstract ]
Data aggregation in wireless sensor networks is vulnerable to security attacks and natural failures. A few nodes can drastically alter the result of the aggregation by reporting erroneous data. In this paper we present RDAS, a robust data aggregation protocol that uses a reputation-based approach to identify and isolate malicious nodes in a sensor network. RDAS is based on a hierarchical clustering arrangement of nodes, where a cluster head analyzes data from the cluster nodes to determine the location of an event. It uses the redundancy of multiple nodes sensing an event to determine what data should have been reported by each node. Nodes form part of a distributed reputation system, where they share information about other node?s performance in reporting accurate data and use the reputation ratings to suppress the reports from malicious nodes. RDAS is able to perform accurate data aggregation in the presence of individually malicious and colluding nodes, as well as nodes that try to compromise the integrity of the reputation system by lying about other nodes? behavior. We show that RDAS is more resilient to security attacks with respect to accuracy of event localization than the baseline data aggregation protocol with no security feature.
"A Tale of Two Synchronizing Clocks": Jinkyu Koo, Rajesh Krishna Panta, Saurabh Bagchi, and Luis Montestruque. In: 7th ACM Conference on Embedded Networked Sensor Systems (SenSys), November 4-6, 2009, Berkeley, California. (Acceptance rate: 21/119 = 17.6%) [ Presentation ] [ abstract ]
A specific application for wastewater monitoring and actuation, called SwimNet, deployed city-wide in a mid-sized US city, posed some challenges to a time synchronization protocol. The nodes in SwimNet have a low duty cycle (2% in current deployment) and use an external clock, called the Real Time Clock (RTC), for triggering the sleep and the wake-up. The RTC has a very low drift (2 ppm) over the wide range of temperature fluctuations that the SwimNet nodes have, while having a low power consumption (0.66 mW). However, these clocks will still have to be synchronized occasionally during the long lifetime of the Swim-Net nodes and this was the problem we confronted with our time synchronization protocol. The RTC to fit within the power and the cost constraints makes the tradeoff of having a coarse time granularity of only 1 second. Therefore, it is not sufficient to synchronize the RTC itself?that would mean a synchronization error of up to 1 second would be possible even with a perfect synchronization protocol. This would be unacceptable for the low duty cycle operation?each node stays awake for only 6 seconds in a 5 minute time window. This was the first of three challenges for time synchronization. The second challenge is that the synchronization has to be extremely fast since ideally the entire network should be synchronized during the 6 second wake-up period. Third, the long range radio used for the metropolitan-scale SwimNet does not make its radio stack software available, as is seen with several other radios for long-range ISM band RF communication. Therefore, a common technique for time synchronization?MAC layer time-stamping?cannot be used. Additionally, MAC layer time-stamping is known to be problematic with high speed radios (even at 250 kbps).
We solve these challenges and design a synchronization protocol called HARMONIA. It has three design innovations. First, it uses the finely granular microcontroller clock to achieve synchronization of the RTC, such that the synchronization error, despite the coarse granularity of the RTC, is in the microsecond range. Second, HARMONIA pipelines the synchronization messages through the network resulting in fast synchronization of the entire network. Third, HARMONIA provides failure handling for transient node and link failures such that the network is not overburdened with synchronization messages and the recovery is done locally. We evaluate HARMONIA on SwimNet nodes and compare the two metrics of synchronization error and synchronization speed with FTSP. It performs slightly worse in the former and significantly better in the latter.
"Multigrade Security Monitoring for Ad-Hoc Wireless Networks": Matthew Tan Creti, Matthew Beaman, Saurabh Bagchi, Zhiyuan Li, and Yung-Hsiang Lu. In: 6th IEEE International Conference on Mobile Ad-hoc and Sensor Systems (MASS), October 12-15, 2009, Macau SAR, China. (Acceptance rate: 62/245 = 25.3%). [ Presentation ] [ abstract ]
Ad-hoc wireless networks are being deployed in critical applications that require protection against sophisticated adversaries. However, wireless routing protocols, such as the widely-used AODV, are often designed with the assumption that nodes are benign. Cryptographic extensions such as Secure AODV (SAODV) protect against some attacks but are still vulnerable to easily-performed attacks using colluding adversaries, such as the wormhole attack. In this paper, we make two contributions to securing routing protocols. First, we present a protocol called Route Verification (RV) that can detect and isolate malicious nodes involved in routingbased attacks with very high likelihood. However, RV is expensive in terms of energy consumption due to its radio communications. To remedy the high energy cost of RV, we make our second contribution. We propose a multigrade monitoring (MGM) approach. The MGM approach employs a previously developed lightweight local monitoring technique to detect any necessary condition for an attack to succeed. However, local monitoring suffers from false positives due to collisions on the wireless channel. When a necessary condition is detected, the heavy-weight RV protocol is triggered. We show through simulation that MGM applied to AODV generally requires little extra energy compared to baseline AODV, under the common case where there is no attack present. It is also more resource-efficient and powerful than SAODV in detecting attacks. Our work, for the first time, lays out the framework of multigrade monitoring, which we believe fundamentally addresses the tension between security and resource consumption in ad-hoc wireless networks.
"UnMask: Utilizing Neighbor Monitoring for Attack Mitigation in Multihop Wireless Sensor Networks": Issa Khalil, Saurabh Bagchi, Cristina N.-Rotaru, and Ness Shroff. In Elsevier Ad Hoc Networks Journal, notification of acceptance: June 2009. [ abstract ]
Sensor networks enable a wide range of applications in both military and civilian domains. However, the deployment scenarios, the functionality requirements, and the limited capabilities of these networks expose them to a wide-range of attacks against control traffic (such as wormholes, rushing, Sybil attacks, etc) and data traffic (such as selective forwarding). In this paper we propose a framework called UNMASK that mitigates such attacks by detecting, diagnosing, and isolating the malicious nodes. UNMASK uses as a fundamental building block the ability of a node to oversee its neighboring nodes? communication. On top of UNMASK, we build a secure routing protocol, LSR, that provides additional protection against malicious nodes by supporting multiple node-disjoint paths. We analyze the security guarantees of UNMASK and use ns-2 simulations to show its effectiveness against representative control and data attacks. The overhead analysis we present shows that UNMASK is a lightweight protocol appropriate for securing resource constrained sensor networks.
"Zephyr: Efficient Incremental Reprogramming of Sensor Nodes using Function Call Indirections and Difference Computation": Rajesh Krishna Panta, Saurabh Bagchi, and Samuel P. Midkiff. In: USENIX Annual Technical Conference (USENIX '09), June 14-19, 2009, pp. 411-424, San Diego, California. (Acceptance rate: 32/191 = 16.8%). [ Presentation ] [ abstract ]
Wireless reprogramming of sensor nodes is an essential requirement for long-lived networks since the software functionality changes over time. The amount of information that needs to be wirelessly transmitted during reprogramming should be minimized since reprogramming time and energy depend chiefly on the amount of radio transmissions. In this paper, we present a multihop incremental reprogramming protocol called Zephyr that transfers the delta between the old and the new software and lets the sensor nodes rebuild the new software using the received delta and the old software. It reduces the delta size by using application-level modifications to mitigate the effects of function shifts. Then it compares the binary images at the byte-level with a novel method to create small delta, that is then sent over the wireless network to all the nodes. For a wide range of softwarechange cases that we experimented with, we find that Zephyr transfers 1.83 to 1987 times less traffic through the network than Deluge, the standard reprogramming protocol for TinyOS, and 1.14 to 49 times less than an existing incremental reprogramming protocol by Jeong and Culler.
"Optimal Monitoring in Multi-Channel Multi-Radio Wireless Mesh Networks": Dong-Hoon Shin and Saurabh Bagchi. In: 10th ACM International Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc 2009), pp. 229-238, May 18-21, 2009, New Orleans, LA. (Acceptance rate: 31/175 = 17.7%) [ Presentation ] [ abstract ]

Wireless mesh networks (WMN) are finding increasing usage in city-wide deployments for providing network connectivity. Mesh routers in WMNs typically use multiple wireless channels to enhance the spatial-reuse of frequency bands, often with multiple radios per node. Due to the cooperative nature of WMNs, they are susceptible to many attacks that cannot be defeated by using traditional cryptographic mechanisms of authentication or encryption alone. A solution approach commonly used for defending against such attacks is behavior-based detection in which some nodes overhear communication in their neighborhood to determine if the behavior by a neighbor is legitimate. It has been proposed to use specialized monitoring nodes deployed strategically throughout the network for performing such detection. The problem that arises is where to deploy these monitoring nodes, how to minimize their number, and which channels to tune their radios to, such that the maximum part of the network can be covered. This problem has been solved for single channel networks by a greedy approximation algorithm since the exact solution is NP-hard. The greedy algorithm achieves the best performance, in terms of the worst case, possible among all polynomial-time algorithms provided that P != NP. In this paper, we solve the problem for multi-channel multi-radio WMNs. The intuitive extension of the greedy algorithm destroys the property of best performance. Instead, we formulate the problem as an integer linear program, solve its linear program relaxation, and then use two rounding techniques that we develop by adapting existing rounding schemes. We thereby present two approximation algorithms. The first, computationally-light algorithm, called probabilistic rounding algorithm gives an expected best performance in the worst case. The second, called deterministic rounding algorithm achieves the best worst-case performance in a deterministic manner. To evaluate how the three algorithms perform in practice, we simulate them in random networks and scale-free networks.
"Hermes: Fast and Energy Efficient Incremental Code Updates for Wireless Sensor Networks": Rajesh Krishna Panta, Saurabh Bagchi. In: 28th Annual IEEE Conference on Computer Communications (INFOCOM), pp. 639-647, April 19-25 2009, Rio de Janeiro, Brazil. (Acceptance rate: 282/1435 = 19.7%) [ Presentation ] [ abstract ]

Wireless reprogramming of sensor nodes is a requirement for long-lived networks due to changes in the functionality of the software running on the nodes. The amount of information that needs to be wirelessly transmitted during reprogramming should be minimized to reduce reprogramming time and energy. In this paper, we present a multi-hop incremental reprogramming protocol called Hermes that transfers the delta between the old and new software and lets the sensor nodes rebuild the new software using the received delta and the old software. It reduces the delta by using techniques to mitigate the effects of function and global variable shifts caused by the software modifications. Then it compares the binary images at the byte level with a method to create small delta that needs to be sent over the wireless network to all the nodes. For a wide range of software change scenarios that we experimented with, we find that Hermes transfers up to 201 times less information than Deluge, the standard reprogramming protocol for TinyOS and 64 times less than an existing incremental reprogramming protocol by Jeong and Culler.
"Efficient Wireless Reprogramming through Reduced Bandwidth Usage and Opportunistic Sleeping": Rajesh Krishna Panta, Saurabh Bagchi and Issa M. Khalil. In Elsevier Ad Hoc Networks Journal, Volume 7, Issue 1, pp. 42-62, January 2009.
[ abstract ]
Wireless reprogramming of a sensor network is useful for uploading new code or for changing the functionality of existing code. Reprogramming may be done multiple times during a node?s lifetime and therefore a node has to remain receptive to future code updates. Existing reprogramming protocols, including Deluge, achieve this by bundling the reprogramming protocol and the application as one code image which is transferred through the network. The reprogramming protocol being complex, the overall size of the program image that needs to be transferred over the wireless medium increases, thereby increasing the time and energy required for reprogramming a network. We present a protocol called Stream that significantly reduces this bloat by using the facility of having multiple code images on the node. It pre-installs the reprogramming protocol as one image and equips the application program with the ability to listen to new code updates and switch to this image. For a sample application, the increase in size of the application image is 1 page (48 packets of 36 bytes each) for Stream and 11 pages for Deluge. Additionally, we design an opportunistic sleeping scheme whereby nodes can sleep during the period when reprogramming has been initiated but has not yet reached the neighborhood of the node. The savings become significant for large networks and for frequent reprogramming. We implement Stream on Mica2 motes and conduct testbed and simulation experiments to compare delay and energy consumption for different network sizes with respect to the state-of-the-art Deluge protocol.
"SeNDORComm: An Energy-Efficient Priority-Driven Communication Layer for Reliable Wireless Sensor Networks": Vinai Sundaram, Saurabh Bagchi, Yung-Hsiang Lu, and Zhiyuan Li. In: 27th International Symposium on Reliable Distributed Systems (SRDS), pp. 23-32, Naples, Italy, October 6-8, 2008. (Acceptance rate: 28/112 = 25%) [ presentation ]
[ abstract ]

In many reliable Wireless Sensor Network (WSN) applications, messages have different priorities depending on urgency or importance. For example, a message reporting the failure of all nodes in a region is more important than that for a single node. Moreover, traffic can be bursty in nature, such as when a correlated error is reported by multiple nodes running identical code. Current communication layers in WSNs lack efficient support for these two requirements. We present a priority-driven communication layer, called SeNDORComm, which schedules transmission of packets driven by application-specified priority, buffers and packs multiple messages in a packet, and honors the latency guarantee for a message. We show that SeNDORComm improves energy efficiency, message reliability, and network utilization and delays congestion in a network. We extensively evaluate SeNDORComm using analysis, simulation, and testbed experiments. We demonstrate the improvement in goodput of SeNDORComm over the default communication layer, GenericComm in TinyOS (134.78% for a network of 20 nodes).
"MISPAR: Mitigating Stealthy Packet Dropping in Locally-Monitored Multi-hop Wireless Ad Hoc Networks": Issa Khalil and Saurabh Bagchi. In: 4th International Conference on Security and Privacy in Communication Networks (SecureComm), 10 pages, Istanbul, Turkey, September 22-25, 2008. (Acceptance rate: 26/124 = 21%) [ abstract ]

Local monitoring has been demonstrated as a powerful technique for mitigating security attacks in multi-hop ad-hoc networks. In local monitoring, nodes overhear partial neighborhood communication to detect misbehavior such as packet drop or delay. However, local monitoring as presented in the literature is vulnerable to a class of attacks that we introduce here called stealthy packet dropping. Stealthy packet dropping disrupts the packet from reaching the destination by malicious behavior at an intermediate node. However, the malicious node gives the impression to its neighbors that it performed the legitimate forwarding action. Moreover, a legitimate node comes under suspicion. We introduce four ways of achieving stealthy packet dropping, none of which is currently detectable. We provide a protocol called MISPAR based on local monitoring to remedy each attack. It presents two techniques ? having the neighbors maintain additional information about the routing path, and adding some checking responsibility to each neighbor. We show through analysis and simulation that the basic local monitoring fails to mitigate any of the presented attacks while MISPAR successfully mitigates them.
"Single versus Multi-hop Wireless Reprogramming in Sensor Networks": Rajesh Krishna Panta, Issa Khalil, Saurabh Bagchi, Luis Montestruque. In: 4th International Conference on Testbeds and Research Infrastructures for the Development of Networks & Communities (Tridentcom), 7 pages, Innsbruck, Austria, March 18-20, 2008. [ Presentation ] [ abstract ]

Wireless reprogramming of the sensor network is useful for uploading new code or for changing the functionality of the existing code. In recent years, the research focus has shifted from single hop reprogramming to multi-hop reprogramming primarily because of its ease of use. Practical experience from a multi-hop sensor network for monitoring water pollution, called CSOnet, deployed in South Bend, IN, indicates that single-hop reprogramming may be preferable under certain conditions to minimize reprogramming time and energy. In this, the user gets close to a node to be reprogrammed and wirelessly reprograms a single node at a time. The choice between single hop and multi-hop reprogramming depends on factors like network size, node density and most importantly, link reliabilities. We present a protocol called DStream having both single and multi-hop reprogramming capabilities. We provide mathematical analysis and results from testbed experiments (including experiments conducted on CSOnet networks) and simulations to give insights into the choice of the two reprogramming methods for various network parameters.
"Optimizing AES for Embedded Devices and Wireless Sensor Networks": Shammi Didla, Aaron Ault and Saurabh Bagchi. In: 4th International Conference on Testbeds and Research Infrastructures for the Development of Networks & Communities (Tridentcom), pp. 1-10, Innsbruck, Austria, March 18-20, 2008. [ Presentation ] [ abstract ]

The increased need for security in embedded applications in recent years has prompted efforts to develop encryption/decryption algorithms capable of running on resource-constrained systems. The inclusion of the Advanced Encryption Standard (AES) in the IEEE 802.15.4 Zigbee protocol has driven its widespread use in current embedded platforms.We propose an implementation of AES in a high-level language (C in this case) that is the first software-based solution for 16-bit microcontrollers capable of matching the communication rate of 250 kbps specified by the Zigbee protocol, while also minimizing RAM and ROM usage. We discuss a series of optimizations and their effects that lead to our final implementation achieving an encryption speed of 286 kbps, RAM usage of 260 bytes, and code size of 5160 bytes on the Texas Instruments MSP430 microprocessor. We also develop rigorous benchmark experiments to compare other AES implementations on a common platform, and show that our implementation outperforms the best available implementation by 85%.
"Energy-efficient, On-demand Reprogramming of Large-Scale Sensor Networks": Mark D. Krasniewski, Rajesh K. Panta, Saurabh Bagchi, Chin-Lung Yang, and William J. Chappell. In: ACM Transactions on Sensor Networks (TOSN), notification of acceptance: June 2007. [ abstract ]

As sensor networks operate over long periods of deployment in difficult to reach places, their requirements may change or new code may need to be uploaded to them. The current state of the art protocols (Deluge and MNP) for network reprogramming perform the code dissemination in a multi-hop manner using a three way handshake whereby meta-data is exchanged prior to code exchange to suppress redundant transmissions. The code image is also pipelined through the network at the granularity of pages. In this paper we propose a protocol called Freshet for optimizing the energy for code upload and speeding up the dissemination if multiple sources of code are available. The energy optimization is achieved by equipping each node with limited non-local topology information, which it uses to determine the time when it can go to sleep since code is not being distributed in its vicinity. The protocol to handle multiple sources provides a loose coupling of nodes to a source and disseminates code in waves each originating at a source, with mechanism to handle collisions when the waves meet. The protocol?s performance with respect to reliability, delay, and energy consumed, is demonstrated through analysis, simulation, and implementation on the Berkeley mote platform.
"Adaptive Correctness Monitoring for Wireless Sensor Networks Using Hierarchical Distributed Run-Time Invariant Checking": Douglas Herbert, Vinaitheerthan Sundaram, Yung-Hsiang Lu, Saurabh Bagchi, and Zhiyuan Li. In ACM Transactions on Autonomous and Adaptive Systems (TAAS), notification of acceptance: May 2007. [ abstract ]

This paper presents a hierarchical approach for detecting faults in wireless sensor networks (WSNs) after they have been deployed. The developers of WSNs can specify ?invariants? that must be satisfied by the WSNs. We present a framework, Hierarchical SEnsor Network Debugging (H-SEND), for lightweight checking of invariants. H-SEND is able to detect a large class of faults in data gathering WSNs and leverages the existing message flow in the network by buffering and piggybacking messages. H-SEND checks as closely to the source of a fault as possible, pinpointing the fault quickly and efficiently in terms of additional network traffic. Therefore, H-SEND is suited to bandwidth or communication energy constrained networks. A specification expression is provided for specifying invariants so that a protocol developer can write behavioral level invariants. We hypothesize that data from sensor nodes does not change dramatically, but rather changes gradually over time. We extend our framework for the invariants that include values determined at run time in order to detect the violation of data trends. The value range can be based on information local to a single node or the surrounding nodes? values. Using our system, developers can write invariants to detect data trends without prior knowledge of correct values. Automatic value detection can be used to detect anomalies that were not previously possible. To demonstrate the benefits of run-time range detection and fault checking, we construct a prototype WSN using CO2 and temperature sensors coupled to Mica2 motes. We show that our method can detect sudden changes of the environments with little overhead in communication, computation, and storage.
"LITEWORP: Detection and Isolation of the Wormhole Attack in Static Multihop Wireless Networks": Issa Khalil, Suarabh Bagchi, and Ness B. Shroff. In: Elsevier Computer Networks Journal, notification of acceptance: April 2007.
[ abstract ]

In multihop wireless systems, such as ad-hoc and sensor networks, the need for cooperation among nodes to relay each other?s packets exposes them to a wide range of security attacks. A particularly devastating attack is known as the wormhole attack, where a malicious node records control and data traffic at one location and tunnels it to a colluding node far away, which replays it locally. This can either disrupt route establishment or make routes pass through the malicious nodes. In this paper, we present a lightweight countermeasure for the wormhole attack, called LITEWORP, which relies on overhearing neighbor communication. LITEWORP is particularly suitable for resource-constrained multihop wireless networks, such as sensor networks. Our solution allows detection of the wormhole, followed by isolation of the malicious nodes. Simulation results show that every wormhole is detected and isolated within a very short period of time over a large range of scenarios. The results also show that the fraction of packets lost due to the wormhole when LITEWORP is applied is negligible compared to the loss in an unprotected network. Simulation results bring out the configuration where no framing is possible, while still having high detection rate. Analysis is done to show the low resource consumption of LITEWORP, the low detection latency, and the likelihood of framing by malicious nodes.
"MOBIWORP: Mitigation of the Wormhole Attack in Mobile Multihop Wireless Networks": Issa Khalil, Saurabh Bagchi, and Ness B. Shroff. In: Elsevier's Journal of AdHoc Networks, notification of acceptance: February 2007.
[ abstract ]

In multihop wireless systems, the need for cooperation among nodes to relay each other's packets exposes them to a wide range of security attacks. A particularly devastating attack is the wormhole attack, where a malicious node records control traffic at one location and tunnels it to a colluding node, possibly far away, which replays it locally. This can have an adverse effect on route establishment by preventing nodes from discovering legitimate routes that are more than two hops away. Previous works on tolerating wormhole attacks have focused only on detection and used specialized hardware, such as directional antennas or extremely accurate clocks. More recent work has addressed the problem of locally isolating the malicious nodes. However, all of this work has been done in the context of static networks due to the difficulty of secure neighbor discovery with mobile nodes. The existing work on secure neighbor discovery has limitations in accuracy, resource requirements, and applicability to ad hoc and sensor networks. In this paper, we present a countermeasure for the wormhole attack, called MOBIWORP, which alleviates these drawbacks and efficiently mitigates the wormhole attack in mobile networks. MOBIWORP uses a secure central authority (CA) for global tracking of node positions. Local monitoring is used to detect and isolate malicious nodes locally. Additionally, when sufficient suspicion builds up at the CA, it enforces a global isolation of the malicious node from the whole network. The effect of MOBIWORP on the data traffic and the fidelity of detection is brought out through extensive simulation using ns-2. The results show that as time progresses, the data packet drop ratio goes to zero with MOBIWORP due the capability of MOBIWORP to detect, diagnose and isolate malicious nodes. With an appropriate choice of design parameters, MOBIWORP is shown to completely eliminate framing of a legitimate node by malicious nodes, at the cost of a slight increase in the drop ratio. The results also show that increasing mobility of the nodes degrades the performance of MOBIWORP.
"SLAM: Sleep-Wake Aware Local Monitoring in Sensor Networks", Issa Khalil, Saurabh Bagchi, and Ness B. Shroff. In: IEEE Symposium on Dependable Systems and Networks (DSN), June 25-28, 2007, Edinburgh, Ireland. [ Presentation ] [ abstract ]

Sleep-wake protocols are critical in sensor networks to ensure long-lived operation. However, an open problem is how to develop efficient mechanisms that can be incorporated with sleep-wake protocols to ensure both longlived operation and a high degree of security. Our contribution in this paper is to address this problem by using local monitoring, a powerful technique for detecting and mitigating control and data attacks in sensor networks. In local monitoring, each node oversees part of the traffic going in and out of its neighbors to determine if the behavior is suspicious, such as, unusually long delay in forwarding a packet. Here, we present a protocol called SLAM to make local monitoring parsimonious in its energy consumption and to integrate it with any extant sleep-wake protocol in the network. The challenge is to enable sleep-wake in a secure manner even in the face of nodes that may be adversarial and not wake up nodes responsible for monitoring its traffic. We prove analytically that the security coverage is not weakened by the protocol. We perform simulations in ns-2 to demonstrate that the performance of local monitoring is practically unchanged while listening energy saving of 30 to 129 times is achieved, depending on the network load.
"Fault Tolerant ARIMA-based Aggregation of Data in Sensor Networks": Doug Herbert, Gaspar Modelo-Howard, Carlos Perez-Toro, Saurabh Bagchi. In: Fast Abstract in the Supplemental Proceedings of the International Conference on Dependable Systems and Networks (DSN), June 25-28, 2007, Edinburgh, Ireland. [ introduction ]

Sensor networks collect data using robust, energy efficient, low bandwidth protocols due to the few resources available to them. Among the many methods explored to improve the efficiency of sensor networks for data collection, researchers have proposed statistical methods to predict when each sensor should send data. Two shortcomings to this approach are: (1) the use of a static model, assuming unchanging environmental parameters to the sensor network and (2) the lack of an error detection mechanism to compensate for failures during the transmission of data. Suppressing data transmissions by using the fact that data follows a statistical model has been proposed before, with the assumption that the environmental parameters do not change. This assumption does not fit several scenarios where sensor networks can be deployed. Additionally, recent advancements in sensor networks allow for more powerful (CPU and memory) on sensing nodes. This represents an opportunity to extend the role of end nodes, by allowing them to perform statistical analysis on the collected data at runtime and updating model parameters in real time. In this manner, we can create a self adapting model.
"Stream: Low Overhead Wireless Reprogramming for Sensor Networks", Rajesh Krishna Panta, Issa Khalil, and Saurabh Bagchi. 26th Annual IEEE Conference on Computer Communications (INFOCOM), pp. 928-936, May 6-12 2007, Anchorage, Alaska, USA. (Acceptance rate: 252/~1400 = 18%) [ Presentation ] [ abstract ]

Wireless reprogramming of a sensor network is useful for uploading new code or for changing the functionality of existing code. Through the process, a node should remain receptive to future code updates because reprogramming may be done multiple times during the node?s lifetime. Existing reprogramming protocols, such as Deluge, achieve this by bundling the reprogramming protocol and the application as one program image, thereby increasing the overall size of the image which is transferred through the network. This increases both time and energy required for network reprogramming. We present a protocol called Stream that mitigates the problem by significantly reducing the size of the program image. Using the facility of having multiple code images on a node and switching between them, Stream pre-installs the reprogramming protocol as one image and the application program equipped with the ability to listen to new code updates as the second image. For a sample application, Stream reduces the size of the program image by 10 pages (48 packets/page) compared to Deluge. Stream is implemented on the Mica2 sensor nodes and we conduct testbed and simulation experiments to show the reduction in energy and reprogramming time of Stream compared to Deluge.
"Data-Centric Routing in Sensor Networks: Single-hop Broadcast or Multi-hop Unicast", Xuan Zhong, Ravish Khosla, Gunjan Khanna , Saurabh Bagchi and Edward J. Coyle. IEEE 65th Vehicular Technology Conference (VTC2007-Spring), pp. 150-154, April 22 - 25 2007. (Acceptance rate: 685/1443 ~ 47.4%) [ abstract ]

Data dissemination strategies and communication protocols that minimize the use of energy can significantly prolong the lifetime of a sensor network. Data-centric dissemination strategies seek energy efficiency by employing short metadata descriptions in advertisements (ADVs) of the availability of data, short requests (REQs) to obtain the data by nodes that are interested in it, and data transmissions (DATA) to deliver data to the requesting nodes. An important decision in this process is whether the DATA transmission should be made at full power in broadcast mode or at low power in multi-hop unicast mode. The determining factor is shown in this paper to be the fraction of nodes that are interested in the DATA, as shown by the number of REQs that are generated. Closed form expressions for this critical fraction of interested nodes is derived when the nodes have no memory or infinite memory for state information and when transmissions are reliable and not reliable. These results can be used during both the design and operation of the network to increase energy efficiency and network longevity.
"Analysis and Evaluation of SECOS, a Protocol for Energy Efficient and Secure Communication in Sensor Networks?, Issa Khalil, Saurabh Bagchi, and Ness B. Shroff. Elsevier Ad-Hoc Networks Journal, Volume 5, Issue 3, pp. 360-391, April 2007.
[ abstract ]

Wireless sensor networks are increasingly being used in applications where the communication between nodes needs to be protected from eavesdropping and tampering. Such protection is typically provided using techniques from symmetric key cryptography. The protocols in this domain suffer from one or more of the following problems - weak security guarantees if some nodes are compromised, lack of scalability, high energy overhead for key management, and increased end-to-end data latency. In this paper, we propose a protocol called SECOS that mitigates these problems in static sensor networks. SECOS divides the sensor field into control groups each with a control node. Data exchange between nodes within a control group happens through the mediation of the control head which provides the common key. The keys are refreshed periodically and the control nodes are changed periodically to enhance security. SECOS enhances the survivability of the network by handling compromise and failures of control nodes. It provides the guarantee that the communication between any two sensor nodes remains secure despite the compromise of any number of other nodes in the network. The experiments based on a simulation model show a seven time reduction in energy overhead and a 50% reduction in latency compared to SPINS, which is one of the state-of-the-art protocols for key management in sensor networks.
"Performance Comparison of SPIN based Push-Pull Protocols", Ravish Khosla, Xuan Zhong, Gunjan Khanna, Saurabh Bagchi,and Edward J. Coyle. IEEE Wireless Communications and Networking Conference (WCNC), Mar 11-15, 2007, Hong Kong. (Acceptance rate: 48%) [ abstract ]

Multiple data-centric protocols - which can broadly be classified as push-pull, push-only, or pull-only - have been proposed in the literature. In this paper we present a framework to develop an insight into the characteristics of push-pull protocols. The performance of push-pull protocols is critically dependent on the time-out settings used to trigger failure recovery mechanisms. We perform a study of how to choose optimal timeouts to achieve best performance. Our starting point is a recently proposed SPIN-based protocol, called Shortest-Path Minded SPIN (SPMS), in which meta-data negotiations take place prior to data exchange in order to minimize the number of data transmissions. We propose a redesign of SPMS, called SPMS-Rec, which reduces the energy expended in the event of failures by requiring intermediate relay nodes to try alternate routes. Our simulation results show that SPMS-Rec outperforms SPMS, and thus SPIN, yielding energy savings while reducing the delay when multiple nodes fail along a route. We further propose a modification to SPMS-Rec through request suppression which helps in reducing redundant data transmissions.
MOBIWORP: Mitigation of the Wormhole Attack in Mobile Multihop Wireless Networks", Issa Khalil, Saurabh Bagchi, and Ness B. Shroff. Accepted for publication in Elsevier's Journal of Ad Hoc Networks (acceptance notification: February 2007).
[ abstract ]

In multihop wireless systems, the need for cooperation among nodes to relay each other's packets exposes them to a wide range of security attacks. A particularly devastating attack is the wormhole attack, where a malicious node records control traffic at one location and tunnels it to a colluding node, possibly far away, which replays it locally. This can have an adverse effect on route establishment by preventing nodes from discovering legitimate routes that are more than two hops away. Previous works on tolerating wormhole attacks have focused only on detection and used specialized hardware, such as directional antennas or extremely accurate clocks. More recent work has addressed the problem of locally isolating the malicious nodes. However, all of this work has been done in the context of static networks due to the difficulty of secure neighbor discovery with mobile nodes. The existing work on secure neighbor discovery has limitations in accuracy, resource requirements, and applicability to ad hoc and sensor networks. In this paper, we present a countermeasure for the wormhole attack, called MOBIWORP, which alleviates these drawbacks and efficiently mitigates the wormhole attack in mobile networks. MOBIWORP uses a secure central authority (CA) for global tracking of node positions. Local monitoring is used to detect and isolate malicious nodes locally. Additionally, when sufficient suspicion builds up at the CA, it enforces a global isolation of the malicious node from the whole network. The effect of MOBIWORP on the data traffic and the fidelity of detection is brought out through extensive simulation using ns-2. The results show that as time progresses, the data packet drop ratio goes to zero with MOBIWORP due the capability of MOBIWORP to detect, diagnose and isolate malicious nodes. With an appropriate choice of design parameters, MOBIWORP is shown to completely eliminate framing of a legitimate node by malicious nodes, at the cost of a slight increase in the drop ratio. The results also show that increasing mobility of the nodes degrades the performance of MOBIWORP.
"Topology Insensitive Location Determination Using Independent Estimates Through Semi-Directional Antennas" Chin-Lung Yang, Saurabh Bagchi, and William J. Chappell. IEEE Transactions on Antennas and Propagation, Volume: 54 , Issue: 11 , Part 2, pp. 3458 ? 3472, Nov. 2006. [ abstract ]

We demonstrate the effect of using multiple estimations from independent single wireless motes in order to decrease network topology dependence on location estimation in a wireless sensor network. A method of determining the location of a target by using multiple compact semi-directional antennas is shown to give an independent estimate of location from each sensor mote in a network, each estimate not relying on the data from neighboring motes as in the case of traditional triangulation. We begin by demonstrating a method of using angular diversity through multiple semi-directional antennas in order to ascertain the location of a target. The estimation of both range and angle is demonstrated in the presence of a noisy and/or faded channel. An efficient and fast algorithm on a wireless sensor mote is presented through a Taylor series expansion of the simulated antenna pattern. Furthermore, using the results from the location estimation from a single node, location determination in a realistic network is explored through both theory and simulation. The results indicate that our proposed algorithm depends significantly less on the topology (spatial arrangement) of the anchor nodes. While network planning for a variety of topologies of anchor nodes is shown to be necessary when using triangulation, our proposed algorithm is insensitive to the deployments of the anchor nodes. A testbed was created in order to experimentally demonstrate that the predictions are accurate even in triangulation-adverse topologies. The experimental testbed shows that a linear arrangement of closely spaced sensors can reduce the location error to one-fourth of the location error using triangulation.
"MOBIWORP : Mitigation of the Wormhole Attack in Mobile Multihop Wireless Networks", Issa Khalil, Saurabh Bagchi, and Ness Shroff. International Conference on Security and Privacy in Communication Networks, Aug 28-Sep 1, 2006, Baltimore, Maryland, USA. (Acceptancerate: 32/126 ~ 25.4%) [ Presentation ] [ abstract ]

In multihop wireless systems, the need for cooperation among nodes to relay each other's packets exposes them to a wide range of security attacks. A particularly devastating attack is the wormhole attack, where a malicious node records control traffic at one location and tunnels it to a colluding node, possibly far away, which replays it locally. This can have an adverse effect on route establishment by preventing nodes from discovering legitimate routes that are more than two hops away. Previous works on tolerating wormhole attacks have focused only on detection and used specialized hardware, such as directional antennas or extremely accurate clocks. More recent work has addressed the problem of locally isolating the malicious nodes. However, all of this work has been done in the context of static networks due to the difficulty of secure neighbor discovery with mobile nodes. The existing work on secure neighbor discovery has limitations in accuracy, resource requirements, and applicability to ad hoc and sensor networks. In this paper, we present a countermeasure for the wormhole attack, called MOBIWORP, which alleviates these drawbacks and efficiently mitigates the wormhole attack in mobile networks. MOBIWORP uses a secure central authority (CA) for global tracking of node positions. Local monitoring is used to detect and isolate malicious nodes locally. Additionally, when sufficient suspicion builds up at the CA, it enforces a global isolation of the malicious node from the whole network. The effect of MOBIWORP on the data traffic and the fidelity of detection is brought out through extensive simulation using ns-2.
"Detection and Repair of Software Errors in Hierarchical Sensor Networks", Douglas Herbert, Yung-Hsiang Lu, Saurabh Bagchi, and Zhiyuan Li. IEEE International Conference on Sensor Networks, Ubiquitous, and Trustworthy Computing (SUTC2006), June 5-7, 2006, Taichung, Taiwan. (Runner-up for best paper) (Acceptance rate: 50/210 ~ 23.8%) [ abstract ]

Sensor networks are being increasingly deployed for collecting critical data in various applications. Once deployed, a sensor network may experience faults at the individual node level or at an aggregate network level due to design errors in the protocol, implementation errors, or deployment conditions that are significantly different from the target environment. In many applications, the deployed system may fail to collect data in an accurate, complete, and timely manner due to such errors. If the network produces incorrect data, the resulting decisions on the data may be incorrect, and negatively impact the application. Hence, it is important to detect and diagnose these faults through run-time observation. Existing technologies face difficulty with wireless sensor networks due to the large scale of the networks, the resource constraints of bandwidth and energy on the sensing nodes, and the unreliability of the observation channels for recording the behavior. This paper presents a semi-automatic approach named H-SEND (Hierarchical SEnsor Network Debugging) to observe the health of a sensor network and to remotely repair errors by reprogramming through the wireless network. In H-SEND, a programmer specifies correctness properties of the protocol (?invariants?). These invariants are associated with conditions (the ?observed variables?) of individual nodes or the network. The compiler automatically inserts checking code to ensure that the observed variables satisfy the invariants. The checking can be done locally or remotely, depending on the nature of the invariant. In the latter case, messages are generated automatically. If an error is detected at run-time, the logs of the observed variables are examined to analyze and correct the error. After errors are corrected, new programs or patches can be uploaded to the nodes through the wireless network. We construct a prototype to demonstrate the benefit of run-time detection and correction.
"DICAS: Detection, Diagnosis and Isolation of Control Attacks in Sensor Networks", Issa Khalil, Saurabh Bagchi, and Cristina Nina-Rotaru. IEEE Conference on Security and Privacy for Emerging Areas in Communication Networks (SecureComm), September 5 - 9, 2005, Athens, Greece. (Acceptance rate: 32/163 ~ 19.6%). [ abstract ]

Sensor networks enable a wide range of applications in both military and civilian domains. However, the deployment scenarios, the functionality requirements, and the limited capabilities of these networks expose them to a wide-range of attacks against control traffic (such as wormholes, Sybil attacks, rushing attacks, etc). In this paper we propose a lightweight protocol called DICAS that mitigates these attacks by detecting, diagnosing, and isolating the malicious nodes. DICAS uses as a fundamental building block the ability of a node to oversee its neighboring nodes? communication. On top of DICAS, we build a secure routing protocol, LSR, which also produces multiple node-disjoint paths. We analyze the security guarantees of DICAS and use ns-2 simulations to show its effectiveness against three representative attacks. Overhead analysis is conducted to prove the lightweight nature of DICAS.
"Reliable middleware for sensor networks", Mark Daniel Krasniewski. Masters Thesis, major professor: Saurabh Bagchi. Submitted: August 2005. [ Presentation ] [ abstract ]

As sensor networks operate over long periods of time deployed in inaccessible places, each sensor?s requirements, location, and reliability may change as they each are subjected to unpredictable influences, both external and internal to the network. As access to these sensors will usually be limited, it is important that when sensor requirements change that there is a reliable, efficient, and fast means of propagating code updates over the network. This work provides a protocol for quickly disseminating data over a sensor network with high reliability and intelligent power management. It achieves energy savings through forwarding of local information on the network scale, providing nodes an estimation of distance from code updates. This energy saving mechanism is further enhanced through directional antenna hardware, which accurately estimates node locations given a few nodes already aware of their locations. This hardware scheme shows much greater accuracy than triangulation with omni-directional antennas. Coupled with reprogramming the network and locating sensor nodes is the challenge of locating and isolating nodes within the network that function improperly despite software-level updates. To solve this problem this work implements a trust-based protocol that aggregates individual sensor decisions and performance over time to mask out nodes of questionable reliability. Nodes that show continued poor performance are either ignored or removed from the system; the locations of these faulty nodes are easily obtained through directional hardware localization. This scheme proves robust even in cases where half of the network is functioning unreliably.
"TIBFIT: Trust Index Based Fault Tolerance for Arbitrary Data Faults in Sensor Networks", Mark Krasniewski, Padma Varadharajan, Bryan Rabeler, Saurabh Bagchi, and Y. Charlie Hu. International Conference on Dependable Systems and Networks (DSN), June 28 - July 1, 2005, Yokohama, Japan. (Acceptance rate: PDS track 24/115 ~ 20.9%)
[ Camera ready ] [ Presentation ] [ abstract ]

Since sensor data gathering is the primary functionality of sensor networks, it is important to provide a fault tolerant method for reasoning about sensed events in the face of arbitrary failures of nodes sending in the event reports. In this paper, we propose a protocol called TIBFIT to diagnose and mask arbitrary node failures in an event-driven wireless sensor network. In our system model, sensor nodes are organized into clusters with rotating cluster heads. The nodes, including the cluster head, can fail in an arbitrary manner generating missed event reports, false reports, or wrong location reports. Correct nodes are also allowed to make occasional natural errors. Each node is assigned a trust index to indicate its track record in reporting past events correctly. The cluster head analyzes the event reports using the trust index and makes event decisions. TIBFIT is analyzed and simulated using the network simulator ns-2 and its coverage evaluated with a varying number and varying intelligence of the malicious nodes. We show that once TIBFIT gathers enough system state, accurate event detection is possible even if more than 50% of the network nodes are compromised.
"LITEWORP: A Lightweight Countermeasure for the Wormhole Attack in Multihop Wireless Networks", Issa Khalil, Saurabh Bagchi, and Ness B. Shroff. International Conference on Dependable Systems and Networks (DSN), June 28 - July 1, 2005, Yokohama, Japan. (Acceptance rate: PDS track 24/115 ~ 20.9%) [ Camera ready ] [ Presentation ] [ abstract ]

In multihop wireless systems, such as ad-hoc and sensor networks, the need for cooperation among nodes to relay each other?s packets exposes them to a wide range of security attacks. A particularly devastating attack is known as the wormhole attack, where a malicious node records control and data traffic at one location and tunnels it to a colluding node, which replays it locally. This can have an adverse effect in route establishment by preventing nodes from discovering routes that are more than two hops away. In this paper, we present a lightweight countermeasure for the wormhole attack, called LITEWORP, which does not require specialized hardware. LITEWORP is particularly suitable for resource-constrained multihop wireless networks, such as sensor networks. In this paper, we present a detailed description of LITEWORP for static networks, and discuss extensions to mobile networks. Our solution allows detection of the wormhole, followed by isolation of the malicious nodes. Simulation results show that every wormhole is detected and isolated within a very short period of time over a large range of scenarios. The results also show that the fraction of packets lost due to the wormhole when LITEWORP is applied is negligible compared to the loss encountered when the method is not applied.
"Location Estimation in Ad-Hoc Networks with Directional Antenna", Nipoon Malhotra, Mark Krasniewski, Chin-Lung Yang, Saurabh Bagchi, and William Chappell. 25th International Conference on Distributed Computing Systems (ICDCS), June 6-9, 2005, Columbus, Ohio, USA. (Acceptance rate:<14% of 540) [ Presentation ] [ abstract ]

With the development of location aware sensor applications, location determination has become an increasingly important middleware technology. Numerous current technologies for location determination of sensor nodes use the received signal strength from sensor nodes using omni-directional antennas. However, an increasing number of sensor systems are now deploying directional antennas due to their advantages like energy conservation and better bandwidth utilization. In this paper, we present techniques for location determination in a sensor network with directional antennas under different kinds of deployment of the nodes. We show how the location estimation problem can be solved by measuring the received signal strength from just one or two anchors in a 2D plane with directional antennas. We implement our technique using Berkeley MICA2 sensor motes and show that it is up to three times more accurate than triangulation using omni-directional antennas. We also perform Matlab simulations that show the accuracy of location determination with increasing node density.
"Location Tracking with Directional Antennas in Wireless Sensor Networks", Chin-Lung Yang, Saurabh Bagchi, and William J. Chappell. IEEE MTT-S International Microwave Symposium (IMS), June 11-17, 2005, Long Beach, California, USA. [ abstract ]

In this paper, we investigate the use of multiple directional antennas on sensor motes for location determination and mobile node monitoring. One key aspect that distinguishes wireless sensor networks is inexpensive transmitters and receivers that still maintain acceptable connectivity. Therefore, complex RF solutions are often not applicable. We propose and demonstrate a location estimation algorithm on a single sensor node equipped with inexpensive directional antennas by measuring the received signal strength of the transmission peers. This algorithm is further applied to the dynamic tracking of a wandering mote. The location tracking error can be reduced from 30% to 16% by using moving average schemes and merging estimates from different sets of antennas. The mean error of tracking estimates can be obtained to provide the certainty of location tracking. Therefore, only a single mote with angular diverse multiple antennas is needed to determine the location of a mote without triangulation.
"Efficient Collection of Sensor Data in Remote Fields Using Mobile Collectors", Yuldi Tirta, Zhiyuan Li, Yung-Hsiang Lu, and Saurabh Bagchi. In Proceedings of the 13th International Conference on Computer Communications and Networks (ICCCN 2004), October 2004. (Acceptance rate: 81/200 ~ 40.5%) [ abstract ]

This paper proposes using a mobile collector, such as an airplane or a vehicle, to collect sensor data from remote fields. We present three different schedules for the collector: Round-Robin, Rate-Based, and Min Movement. The data are not immediately transmitted to the base station after being sensed but buffered at a cluster head; hence, it is important to ensure the latency is within an acceptable range. We compare the latency and the energy expended of the three schedules. We use the ns-2 network simulator to study the scenarios and illustrate conditions under which Rate-Based outperforms Round-Robin in latency, and vice-versa. The benefit of Min Movement is in minimizing the energy expended.
"Robust Location Determination in Wireless Ad-hoc Networks", Nipoon Malhotra. Masters Thesis submitted in partial fulfillment of the MSECE degree. Supervisor: Prof. Saurabh Bagchi. Submitted: August 2004. [ Presentation ] [ Presentation ]
[ abstract ]

With the development of location aware applications for ad-hoc and sensor networks, location determination has become an important middleware technology. Cost, size, and energy constraints permit only a fraction of the network nodes, called anchors, to carry location determination hardware such as GPS receivers. The first part of our work involves evaluating the effect of topological characteristics, such as coverage and connectivity, on the accuracy of location determination protocols. We study a protocol called Hop-Terrain that uses the received signal strength from anchors for estimating position of sensor nodes. The resultant performance predictions can be used in motion planning for nodes to improve the accuracy of location estimates. An emerging trend in sensor networks is deploying directional RF antennas on nodes due to advantages like energy conservation and better bandwidth utilization. The location estimation techniques for omni-directional antennas cannot be employed in such systems since the received power at a node depends on additional variables like transmission and receiving angles. Our work presents techniques for location determination with directional antennas under different kinds of node deployments, such as globally aligned nodes and unaligned nodes. We show how the problem can be solved in a two-dimensional plane by using just one anchor in contrast to three anchors for omni-directional antennas. We consider the possibility of errors in individual distance measurements and present theoretical as well as simulation-based results for the level of redundancy required to have an aggregate estimation error below a desired threshold. Simulations show improvement in the accuracy of the position estimates over that of the triangulation based approach for omni-directional antennas.
"Fault Tolerant Energy Aware Data Dissemination Protocol in Sensor Network", Gunjan Khanna, Saurabh Bagchi, and Yu-Sung Wu. IEEE Dependable Systems and Networks Conference (DSN 2004), June 28-July 1, 2004, Florence, Italy. (Acceptance rate: PDS track 25/101 ~ 24.8%). [ Camera ready ] [ Presentation ] [ abstract ]

In this paper we present a data dissemination protocol for efficiently distributing data through a sensor network in the face of node and link failures. Our work is motivated by the SPIN protocol which uses metadata negotiation to minimize data transmissions. We propose a protocol called Shortest Path Minded SPIN (SPMS) in which every node has a zone defined by its maximum transmission radius. A node which is a data source advertises the availability of data to all the nodes in its zone using a metadata descriptor. Any interested node requests the data and gets sent the data using multi-hop communication via the shortest path. The failure of any node in the path is detected and recovered using backup routes. We build simulation models to compare SPMS against SPIN. Different scenarios including mobility and node failures are simulated. The simulation results show that SPMS reduces the delay over 10 times and consumes 30% less energy in the static failure free scenario. Even with the addition of mobility, SPMS outperforms SPIN by energy gains between 5% and 21%. An analytical model is also constructed to compare the two protocols under a simplified topology.
"Controlled Mobility for Efficient Data Gathering in Sensor Networks with Passively Mobile Nodes", Yuldi Tirta, Bennett Lau, Nipoon Malhotra, Saurabh Bagchi, Zhiyuan Li, and Yung-Hsiang Lu. Sensor Network Operations, Shashi Phoha (ed.), IEEE Press, Wiley Publications, 2004. (Acceptance rate: 30/90 ~ 33.3%) [ introduction ]

Sensor networks are a particular class of wireless ad hoc networks in which the nodes have small components for sensors, actuators and RF communication components. Sensor nodes are dispersed over the area of interest, called sensor field, and are capable of short-range RF communication (about 100 ft) and contain signal processing engines to manage the communication protocols and for data processing. The individual nodes have a limited processing capacity, but are capable of supporting distributed applications through coordinated effort in a network that can include hundreds or even thousands of nodes. Sensor nodes are typically battery-powered. Since replacing batteries is often very difficult, reducing energy consumption is an important design consideration for sensor networks. Because transmitting power is proportional to the square or quadruple of the transmission range, the range of a sensor node is constrained in most deployments. In sensor networks, the final data gathering and analysis station, called the base station, is sometimes placed far from the sensing nodes. This may be because the sensing field is in a hazardous environment, such as enemy territory, or a physically harsh environment, with high temperature, etc. It may be impossible to locate a protected, high-end base station with large computational and communication capabilities in such a hazardous environment. Since the transmission range of the individual nodes is limited, the large separation between the sensing region and the base station implies long distance and multi-hop transmission has to occur. This architecture is difficult to deploy and maintain with regular sensing nodes acting as relay nodes. Some nodes in the network may be mobile, either in a controlled or in an uncontrolled manner. Uncontrolled mobility, referred to as passive mobility implies the nodes move but not of their own volition. Examples include nodes embedded into animals, nodes carried by human beings who move according to other considerations, and light-weight nodes carried by physical processes such as running water, glaciers, or wind. Nodes with passive mobility make data collection more challenging. Some nodes may move far away from the data aggregation point, such as a base station or a cluster head, and even become disconnected from the rest of the network. With mobility, routing data to the nodes may become inefficient since many sensor network routing protocols rely on position knowledge. Hence, traditionally, mobility has been looked on as an adversary to efficient deployment of sensor networks due to degradation of the topology affecting parameters such as connectivity and coverage or due to increased failure rates of wireless links. We turn this argument on its head and propose to use ?controlled mobility? of certain nodes in the network to our advantage. Controlled mobility implies the ability to control the movement of an entity (direction, velocity, etc.) by sending it control commands.
"Analysis and Evaluation of Topological and Application Characteristics of Unreliable Mobile Wireless Ad-hoc Network", Serdar Cabuk, Nipoon Malhotra, Longbi Lin, Saurabh Bagchi, and Ness Shroff. 10th IEEE Pacific Rim Dependable Computing Conference (PRDC ' 04), March 2004. (Acceptance rate: 34/102 ~ 33.3%) [ Camera ready ] [ Presentation ] [ abstract ]

This paper presents a study of topological characteristics of mobile wireless ad-hoc networks. The characteristics studied are connectivity, coverage, and diameter. Knowledge of topological characteristics of a network aids in the design and performance prediction of network protocols. This paper introduces intelligent goal-directed mobility algorithms for achieving desired topological characteristics and presents a simulation-based study. The study shows that to achieve low, medium and high network QoS defined in terms of combined requirements of the three metrics, the network needs respectively 8, 16, and 40 nodes. If nodes can fail, the requirements increase to 8, 36 and 60 nodes respectively. We present a theoretical derivation of the improvement due to the mobility models and the sufficient condition for connectivity and coverage in the network. Next, we show the effect of improved topological characteristics in enhancing QoS of an application level protocol, namely, a location determination protocol called Hop-Terrain. The study shows that the error in location estimation is reduced by up to 68% with the goal-directed mobility.
"SECOS: Key Management for Scalable and Energy Efficient Crypto On Sensors", Issa Khalil and Saurabh Bagchi. CERIAS Tech Report TR-2003-33. [ abstract ]

Wireless sensor networks are becoming a critical computational infrastructure, in which the communication between nodes needs to be protected from eavesdropping and tampering. Symmetric key cryptography is the fundamental technique being used. The protocols in this domain suffer from one or more of the problems of weak security guarantees if some nodes are compromised, lack of scalability, high energy overhead for key management and increased end-to-end data latency. In this paper, we propose a protocol called SECOS that mitigates these problems. SECOS divides the sensor field into control groups each with a control head. Data exchange between nodes within a control group happens through the mediation of the control head which provides the common key. The keys and the control heads are changed periodically to enhance security. SECOS enhances the survivability of the network by handling failures of control nodes. The experiments based on a simulation model show 7 times reduction in energy overhead and 50% reduction in latency compared to the stateof-the-art protocol, SPINS. We also provide an analytical derivation of the optimal control group size that operates under the resource constraints and minimizes energy consumption.

Distributed Intrusion Tolerant Systems

“A Risk Assessment Tool for Advanced Metering Infrastructures,” Tawfeeq Shawly, Jun Liu, Nathan Burow, Saurabh Bagchi, Robin Berthier (University of Illinois at Urbana-Champaign), and Rakesh B. Bobba (University of Illinois at Urbana-Champaign). Accepted to appear at the 5th IEEE International Conference on Smart Grid Communications (SmartGridComm), pp. 1-6, November 3-6, 2014. (Acceptance rate: 168/398 = 42.2%; in the Security and Privacy track, 41%) [ Abstract ]

The growing reliance on Cyber technology for Smart Grid and other Cyber-Physical Systems (CPS) applications increases high assurance challenges. Advanced Metering Infrastructure (AMI) is a critical part of smart grid and failure to address security risks in it might disrupt utilities in their mission to deliver power reliably. The objective of this paper is to find mitigation actions for attacks that might be launched against AMI. The paper introduces a tool called SecAMI that calculates the relationship between the speed at which an attack spreads, the speed at which it can be detected, and the consequences on the availability of the AMI. The results from SecAMI can be used as a performance metric to significantly improve the development and the deployment of Intrusion Detection Systems (IDSs) in an AMI environment. Experimental results with an IDS called Amilyzer show that centralized IDS might not work efficiently in scalable systems comparing with distributed IDS in term of detection time, especially with a Distributed Denial of Service (DDoS) attack or remote disconnects on the AMI.

[Presentation]
“pSigene: Webcrawling to Generalize SQL Injection Signatures,” Gaspar Modelo-Howard, Fahad A. Arshad, Christopher Gutierrez, Saurabh Bagchi, and Alan Qi. At the 44th Annual IEEE/IFIP International Symposium on Dependable Systems and Networks (DSN), pp. 45-56, June 23 - 26, 2014. (Acceptance rate: 56/185 = 30.3%) [ Abstract ]

Intrusion detection systems (IDS) are an important component to effectively protect computer systems. Misuse detection is the most popular approach to detect intrusions, using a library of signatures to find attacks. The accuracy of the signatures is paramount for an effective IDS, still today’s practitioners rely on manual techniques to improve and update those signatures. We present a system, called pSigene, for the automatic generation of intrusion signatures by mining the vast amount of public data available on attacks. It follows a fourstep process to generate the signatures, by first crawling attack samples from multiple public cybersecurity web portals. Then, a feature set is created from existing detection signatures to model the samples, which are then grouped using a biclustering algorithm which also gives the distinctive features of each cluster. Finally the system automatically creates a set of signatures using regular expressions, one for each cluster. We tested our architecture for SQL injection attacks and found our signatures to have a True and False Positive Rates of 90.52% and 0.03%, respectively and compared our findings to other SQL injection signature sets from popular IDS and web application firewalls. Results show our system to be very competitive to existing signature sets.

[ Presentation ]
“Privatus: Wallet-Friendly Privacy Protection for Smart Meters,” Jinkyu Koo, Xiaojun Lin, and Saurabh Bagchi, At the 17th European Symposium on Research in Computer Security (ESORICS), pp. 1-18, September 10-4, 2012, Pisa, Italy. (Acceptance rate: 50/248 = 20.2%) [ Presentation ] [ Abstract ]

Debugging large-scale parallel applications is challenging. Most existing techniques provide mechanisms for process control, but they provide little information about the causes of failures. Most debuggers also scale poorly despite contin- ued growth in supercomputer core counts. We have devel- oped a novel, highly scalable tool to help developers under- stand and fix correctness problems at scale. Our tool proba- bilistically infers the least progressed task in MPI programs using Markov models of execution history and dependence analysis. This analysis guides program slicing to find code that may have caused a failure. In a blind study, we demon- strate that our tool can isolate the root cause of a partic- ularly perplexing bug encountered at scale in a molecular dynamics simulation. Further, we perform fault injections into two benchmark codes and measure the scalability of the tool. Our results show that it accurately detects the least pro- gressed task in most cases and can perform the diagnosis in a fraction of a second with thousands of tasks.
"v-CAPS: A Confidentiality and Anonymity Preserving Routing Protocol for Content-Based Publish-Subscribe Networks," Amiya Maji and Saurabh Bagchi, At the 7th International ICST Conference on Security and Privacy in Communication Networks (SecureComm), 20 pages (LNCS format), London, UK, Sep 7-9, 2011. (Acceptance rate: 23/95 = 24.2%) [ Abstract ]

Content-based Publish-Subscribe (CBPS) is a widely used communication paradigm where publishers "publish" messages and a set of subscribers receive these messages based on their interests through filtering and routing by an intermediate set of brokers. CBPS has proven to be suitable for many-to-many communication offering flexibility and efficiency in communications between a dynamic set of publishers and subscribers. We are interested in using CBPS in healthcare settings to disseminate health-related information (drug interactions, diagnostic information on diseases) to large numbers of subscribers in a confidentiality-preserving manner. Confidentiality in CBPS requires that the message be hidden from brokers whereas the brokers need the message to compute routing decisions. Previous approaches to achieve these conflicting goals suffer from significant shortcomings ? misrouting, lesser expressivity of subscriber interests, high execution time, and high message overhead. Oursolution, titled v-CAPS, achieves the competing goals while avoiding the previous problems. In v-CAPS, the trusted publishers extract the routing information based on the message and the brokers keep minimal information needed to perform local routing. The routing information is cryptographically secured so that curious brokers or other subscribers cannot learn about the recipients. Our experiments show that v-CAPS has comparable end-to-end message latency to a baseline insecure CBPS system with unencrypted routing vectors. However, the cost of hiding the routing vectors from the brokers is significantly higher.

[ Presentation ]
"Secure Configuration of Intrusion Detection Sensors for Changing Enterprise Systems," Gaspar Modelo-Howard, Jevin Sweval, and Saurabh Bagchi, At the 7th International ICST Conference on Security and Privacy in Communication Networks (SecureComm), 20 pages (LNCS format), London, UK, Sep 7-9, 2011. (Acceptance rate: 23/95 = 24.2%) [ Abstract ] [ Presentation ]

Current attacks to distributed systems involve multiple steps, due to attackers usually taking multiple actions to achieve their goals. Such attacks are called multi-stage attacks and have the ultimate goal to compromise a critical asset for the victim. An example would be compromising a web server, then achieve a series of intermediary steps (such as compromising a developer?s box thanks to a vulnerable PHP module and connecting to a FTP server with gained credentials) to ultimately connect to a database where user credentials are stored. Current detection systems are not capable of analyzing the multi-step attack scenario. In this document we present a distributed detection framework based on a probabilistic reasoning engine that communicates to detection sensors and can achieve two goals: (1) protect the critical asset by detecting multi-stage attacks and (2) tune sensors according to the changing environment of the distributed system monitored by the distributed framework. As shown in the experiments, the framework reduces the number of false positives that it would otherwise report if it were only considering alerts from a single detector and the reconfiguration of sensors allows the framework to detect attacks that take advantage of the changing system environment.
"Spam Detection in Voice-over-IP Calls through Semi-Supervised Clustering": Yu-Sung Wu, Saurabh Bagchi, Navjot Singh, and Ratsameetip Wita. At the 39th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 307-316, June 29-July 2, 2009, Lisbon, Portugal. (Acceptance rate: 63/260 = 24.2%) [ Presentation ] [ abstract ]

In this paper, we present an approach for detection of spam calls over IP telephony called SPIT in VoIP systems. SPIT detection is different from spam detection in email in that the process has to be soft real-time, fewer features are available for examination due to the difficulty of mining voice traffic at runtime, and similarity in signaling traffic between legitimate and malicious callers. Our approach differs from existing work in its adaptability to new environments without the need for laborious and errorprone manual parameter configuration. We use clustering based on the call parameters, using optional user feedback for some calls, which they mark as SPIT or non-SPIT. We improve on a popular algorithm for semi-supervised learning, called MPCK-Means, to make it scalable to a large number of calls and operate at runtime. Our evaluation on captured call traces shows a fifteen fold reduction in computation time, with improvement in detection accuracy.
"Covert TCP/IP Timing Channels: Theory to Implementation": Sarah Sellke, Chih-Chun Wang, Saurabh Bagchi, Ness Shroff. 28th Annual IEEE Conference on Computer Communications (INFOCOM), pp. 2204-2212, April 19-25 2009, Rio de Janeiro, Brazil. (Acceptance rate: 282/1435 = 19.7%). [ Presentation ] [ abstract ]

There has been significant recent interest in covert communication using timing channels. In network timing channels, information is leaked by controlling the time between transmissions of consecutive packets. Our work focuses on network timing channels and provides two main contributions. The first is to quantify the threat posed by covert network timing channels. The other is to use timing channels to communicate at a low data rate without being detected. In this paper, we design and implement a covert TCP/IP timing channel. We are able to quantify the achievable data rate (or leak rate) of such a covert channel. Moreover, we show that by sacrificing data rate, the traffic patterns of the covert timing channel can be made computationally indistinguishable from that of normal traffic, which makes detecting such communication virtually impossible. We demonstrate the efficacy of our solution by showing significant performance gains in terms of both data rate and covertness over the state-of-the-art.
"Intrusion Detection in Voice-over-IP Environments": Yu-Sung Wu, Vinita Apte, Saurabh Bagchi, Sachin Garg, Navjot Singh. Elsevier International Journal of Information Security (IJIS). [ abstract ]

In this article, we present the design of an intrusion detection system for VoIP networks. The first part of our work consists of a simple single-component intrusion detection system called SCIDIVE. In the second part, we extend the design of SCIDIVE and build a distributed and correlation-based intrusion detection system called SPACEDIVE. We create several attack scenarios and evaluate the accuracy and efficiency of the system in the face of these attacks. To the best of our knowledge, this is the first comprehensive look at the problem of intrusion detection in VoIP systems. It includes treatment of the challenges faced due to the distributed nature of the system, the nature of the VoIP traffic, and the specific kinds of attacks at such systems.
"Search for Efficiency in Automated Intrusion Response for Distributed Applications": Yu-Sung Wu, Gaspar Modelo-Howard, Bingrui Foo, Saurabh Bagchi, Eugene Spafford. At the 27th International Symposium on Reliable Distributed Systems (SRDS), pp. 53-62, Naples, Italy, October 6-8, 2008. (Acceptance rate: 28/112 = 25%) [ presentation ] [ abstract ]

Providing automated responses to security incidents in a distributed computing environment has been an important area of research. This is due to the inherent complexity of such systems that makes it difficult to eliminate all vulnerabilities before deployment and costly to rely on humans for responding to incidents in real time. Earlier works have shed the light on automated responses. They pick the best local response that stops an attack propagation from its current step to the next step. Here we propose a new approach where the optimality of responses is considered from a global point of view on “What’s the eventual outcome on the whole system from using a response?”. We formalize the process of providing automated responses and the criterion for asserting global optimality of the set of deployed responses. We show that reaching the globally optimal solution is an NP-hard problem. Therefore we design a genetic algorithm framework for searching for good solutions. In real world, good solutions can change as the problem structure changes. Here the problem structure involves the protected target system and the attacks, both of which can change over time. Our framework constantly adapts itself to the changing environment based on short term history and also tracks the patterns of attacks in a long-term history. We demonstrate the solution on a distributed e-commerce application called Pet Store with injection of real attacks and show that it improves the survivability of the system over previous works.
"Determining Placement of Intrusion Detectors for a Distributed Application through Bayesian Network Modeling": Gaspar Modelo-Howard, Saurabh Bagchi, Guy Lebanon. At the 11th International Symposium on Recent Advances in Intrusion Detection (RAID), pp. 271-290, Boston, MA, September 15-17, 2008. (Acceptance rate: 20/80 = 25%)
[ presentation ] [ abstract ]

To secure today’s computer systems, it is critical to have different intrusion detection sensors embedded in them. The complexity of distributed computer systems makes it difficult to determine the appropriate configuration of these detectors, i.e., their choice and placement. In this paper, we describe a method to evaluate the effect of the detector configuration on the accuracy and precision of determining security goals in the system. For this, we develop a Bayesian network model for the distributed system, from an attack graph representation of multi-stage attacks in the system. We use Bayesian inference to solve the problem of determining the likelihood that an attack goal has been achieved, given a certain set of detector alerts. We quantify the overall detection performance in the system for different detector settings, namely, choice and placement of the detectors, their quality, and levels of uncertainty of adversarial behavior. These observations lead us to a greedy algorithm for determining the optimal detector settings in a large-scale distributed system. We present the results of experiments on Bayesian networks representing two real distributed systems and real attacks on them.
"Intrusion Response Systems: A Survey": Bingrui Foo, Matthew W. Glause, Gaspar Modelo-Howard, Yu-Sung Wu, Saurabh Bagchi, and Eugene Spafford. Book chapter in "Information Assurance: Dependability and Security in Networked Systems", , pp. 377-416, Morgan Kaufmann Publishers. Publication date: Fall 2007. [ abstract ]

Protecting networks from computer security attacks is an important concern of computer security. Within this, intrusion prevention and intrusion detection systems have been the subject of much study and have been covered in several excellent survey papers. However, the actions that need to follow the steps of prevention and detection, namely response, have received less attention from researchers or practitioners. It was traditionally thought of as an offline process, with humans in the loop, such as system administrators performing forensics by going through the system logs and determining which services or components need to be recovered. Our systems today have reached a level of complexity and the attacks directed at them a level of sophistication that manual responses are no longer adequate. So far there has been limited work in autonomous intrusion response systems, especially work that provides rigorous analysis or generalizable system building techniques. The work that exists has not been surveyed previously. In this survey paper, we lay out the design challenges in building autonomous intrusion response systems. Then we provide a classification of existing work on the topic into four categories – response through static decision tables, response through dynamic decision process, intrusion tolerance through diverse replicas, and intrusion response for specific classes of attacks. We describe the existing intrusion response systems after classifying them. We present methods for benchmarking the intrusion response systems. We end with ideas for further work in the field.
"Capacity Bounds on Timing Channels with Bounded Service Times": Sarah H. Sellke, Chih-Chun Wang, Ness Shroff, and Saurabh Bagchi. In the IEEE International Symposium on Information Theory, pp. 981-985, Nice, France, June 24-29, 2007.[ abstract ]

It is well known that queues with exponentially distributed service times have the smallest Shannon capacity among all single-server queues with the same service rate. In this paper, we study the capacity of timing channels in which the service time distributions have bounded support, i.e., Bounded Service Timing Channels (BSTC). We derive an upper bound and two lower bounds on the capacity of such timing channels. The tightness of these bounds is investigated analytically as well as via simulations. We find that the uniform BSTC serves a role for BSTCs that is similar to what the exponential service timing channel does for the case of timing channels with unbounded service time distributions. That is, when the length of the support interval is small, the uniform BSTC has the smallest capacity among all BSTCs.
"Automated Adaptive Intrusion Containment in Systems of Interacting Services", Yu-Sung Wu, Bingrui Foo, Yu-ChunMao, Saurabh Bagchi, Eugene Spafford. Elsevier Journal of Computer Networks, Volume 51, Issue 5, pp. 1334-1360, April 2007.
[ abstract ]

Large scale distributed systems typically have interactions among different services that create an avenue for propagation of a failure from one service to another. The failures being considered may be the result of natural failures or malicious activity, collectively called disruptions. To make these systems tolerant to failures it is necessary to contain the spread of the occurrence automatically once it is detected. The objective is to allow certain parts of the system to continue to provide partial functionality in the system in the face of failures. Real world situations impose several constraints on the design of such a disruption tolerant system of which we consider the following – the alarms may have type I or type II errors; it may not be possible to change the service itself even though the interaction may be changed; attacks may use steps that are not anticipated a priori; and there may be bursts of concurrent alarms. We present the design and implementation of a system named ADEPTS as the realization of such a disruption tolerant system. ADEPTS uses a directed graph representation to model the spread of the failure through the system, presents algorithms for determining appropriate responses and monitoring their effectiveness, and quantifies the effect of disruptions through a high level survivability metric. ADEPTS is demonstrated on a real e-commerce testbed with actual attack patterns injected into it.
"Timing Channel Capacity for Uniform and Gaussian Servers", Sarah Sellke, Ness B. Shroff, Saurabh Bagchi, and Chih-Chun Wang. Forty-Fourth Annual Allerton Conference On Communication, Control, and Computing, 4 pages, Sep 27-29, 2006, Allerton, IL, USA. [ Presentation ] [ abstract ]

It is well known that queues with exponentially distributed service time have the smallest Shannon capacity among all single-server queues with the same service rate. However, the capacity of other service disiplines such as Gaussian service timing channels (GSTC) remains unknown. In this paper, we outline a preliminary investigation into timing channel capacity, when the service distributions are uniform, Gaussian, and truncated Gaussian. Our study indicates that the capacity per service time increases as the ratio of the standard deviation to the mean service time of the server decreases.
"SPACEDIVE: A Distributed Intrusion Detection System for Voice-over-IP Environments", Vinita Apte, Yu-Sung Wu, Saurabh Bagchi, Sachin Garg, and Navjot Singh. Fast Abstract in the Supplemental Proceedings of the International Conference on Dependable Systems and Networks (DSN), pp. 222-223, June 25-28, 2006, Philadelphia, Pennsylvania, USA. [ Presentation ] [ Presentation ] [ abstract ]

Voice over IP (VoIP) systems are gaining in popularity as the technology for transmitting voice traffic over IP networks. Along with the anticipated widespread adoption of VoIP systems comes the possibility of security attacks targeted against such systems. The attacks can be thought of as a combination of traditional kinds of security attacks against IP networks and novel attacks enabled by the architecture of VoIP systems. VoIP applications have soft real time requirements. Second, the attacks can span multiple protocols between different end points and may be spread over arbitrary time periods. Considering a range of attack scenarios seen in practice, we observe that the attack symptom is often detectable only by correlating information from multiple sources. The correlation is required among information from multiple protocols at multiple end points. The correlation may need to be done from sources that are peers, such as, two communicating clients or across peer levels, such as, the communicating clients and the servers. We propose the design of a system called SPACEDIVE to serve as a correlation-based IDS for VoIP systems. The Snort IDS [2] is well known for its efficiency in examining incoming packets and SPACEDIVE leverages the Snort functionality. To achieve good performance, SPACEDIVE is built into Snort using part of its low-level functionality (examining and processing packets) and adding to it (e.g., to build state to support stateful detection) and building completely the high level functionality specific to the VoIP environment.
"Resource Failure Prediction in Fine-Grained Cycle Sharing Systems", Xiaojuan Ren, Seyong Lee, Rudolf Eigenmann, and Saurabh Bagchi. IEEE International Symposium on High Performance Distributed Computing, June 19-23, 2006, Paris, France. (Runner-up for best paper) [ Presentation ] [ abstract ]

Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amount of computational resources available on the Internet. In FGCS, host computers allow guest jobs to utilize the CPU cycles if the jobs do not significantly impact the local users of a host. A characteristic of such resources is that they are generally provided voluntarily and their availability fluctuates highly. Guest jobs may incur resource failures because of unexpected resource unavailability. To provide fault tolerance to guest jobs without adding significant computational overhead, failure prediction is required. This paper presents a method to predict resource failures in FGCS systems. It applies a semi-Markov Process and is based on a novel failure model, combining generic hardware-software failures with domain-specific failures in FGCS. We describe the failure prediction framework and its implementation in a production FGCS system named iShare. Through the experiments on an iShare testbed, we demonstrate that the prediction achieves accuracy above 86% on average and outperforms linear time series models, while the computational cost is negligible. Our experimental results also show that the prediction is robust in the presence of irregular resource failures.
"ADEPTS: Adaptive Intrusion Response using Attack Graphs in an E-Commerce Environment", Bingrui Foo , Yu-Sung Wu, Yu-Chun Mao, Saurabh Bagchi, and Eugene Spafford. International Conference on Dependable Systems and Networks (DSN), June 28- July 1, 2005, Yokohama, Japan. (Acceptance rate: DCCS track 54/204 ~ 26.8%) [ Camera ready ] [ Presentation ]
[ abstract ]

Distributed systems with multiple interacting services, especially e-commerce systems, are suitable targets for malicious attacks because of the potential financial impact. Compared to intrusion detection, automated response has received relatively less attention. In this paper, we present the design of automated response mechanisms in an intrusion tolerant system called ADEPTS. Our focus is on enforcing containment in the system, thus localizing the intrusion and allowing the system to provide service, albeit degraded. ADEPTS uses a graph of intrusion goals, called IGRAPH, as the underlying representation in the system. In response to alerts from an intrusion detection framework, ADEPTS executes algorithms to determine the spread of the intrusion and the appropriate responses to deploy. A feedback mechanism evaluates the success of a deployed response and uses that in guiding future choices. ADEPTS is demonstrated on a distributed e-commerce system and evaluated using a survivability metric.
"Modeling and Automated Containment of Worms", Sarah Sellke, Ness B. Shroff, and Saurabh Bagchi. International Conference on Dependable Systems and Networks (DSN), June 28 - July 1, 2005, Yokohama, Japan. (Acceptance rate: DCCS track 54/204 ~ 26.8%) [ Camera ready ] [ Presentation ] [ abstract ]

Self-propagating codes, called worms, such as Code Red, Nimda, and Slammer, have drawn significant attention due to their enormous adverse impact on the Internet. There is a great interest in the research community in modeling the spread of worms and in providing adequate defense mechanisms against them. In this paper, we present a (stochastic) branching process model for characterizing the propagation of Internet worms. This model leads to the development of an automatic worm containment strategy that prevents the spread of worms beyond its early stages. Specifically, using the branching process model, we are able to (1) provide a precise condition that determines whether the worm will eventually die out and (2) provdide the probability that the total number of hosts that the worm infects will be below a certain level. We use these insights to develop a simple automatic worm containment scheme, which is demonstrated, through simulations and real trace data, to be both effective and non-intrusive.
"SCIDIVE: A Stateful and Cross Protocol Intrusion Detection Architecture for Voice-over-IP Environments", Saurabh Bagchi, Yu-Sung Wu (Purdue U. , USA); Sachin Garg, Navjot Singh, Tim Tsai (Avaya Labs, USA). IEEE Dependable Systems and Networks Conference (DSN 2004), June28-July 1, 2004, Florence, Italy. (Acceptance rate: DCCS track 58/276 ~ 21%).
[ Camera ready ] [ Presentation ] [ abstract ]

Voice over IP (VoIP) systems are gaining in popularity as the technology for transmitting voice traffic over IP networks. As the popularity of VoIP systems increases, they are being subjected to different kinds of intrusions some of which are specific to such systems and some of which follow a general pattern. VoIP systems pose several new challenges to Intrusion Detection System (IDS) designers. First, these systems employ multiple protocols for call management (e.g., SIP) and data delivery (e.g., RTP). Second, the systems are distributed in nature and employ distributed clients, servers and proxies. Third, the attacks to such systems span a large class, from denial of service to billing fraud attacks. Finally, the systems are heterogeneous and typically under several different administrative domains. In this paper, we propose the design of an intrusion detection system targeted to VoIP systems, called SCIDIVE (pronounced “Skydive”). SCIDIVE is structured to detect different classes of intrusions, including, masquerading, denial of service, and media stream-based attacks. It can operate with both classes of protocols that compose VoIP systems – call management protocols (CMP), e.g., SIP, and media delivery protocols (MDP), e.g., RTP. SCIDIVE proposes two abstractions for VoIP IDS -- Stateful detection and Cross-protocol detection. Stateful detection denotes assembling state from multiple packets and using the aggregated state in the rule matching engine. Cross protocol detection denotes matching rules that span multiple protocols. SCIDIVE is demonstrated on a sample VoIP system that comprises SIP clients and SIP proxy servers with RTP as the data delivery protocol. Four attack scenarios are created and the accuracy and the efficiency of the system evaluated with rules meant to catch these attacks.
"ADEPTS: Adaptive Intrusion Containment in Distributed Service Environments", Bingrui Foo, Yu-Sung Wu, Saurabh Bagchi, Gene Spafford, and Blake Matheny. CERIAS Tech Report 2004. [ abstract ]

Distributed systems with multiple interacting services, such as distributed e-commerce systems, are suitable targets for malicious attacks because of the potential financial impact. Intrusion detection in such systems has been an active area of research, while the problem of containment has received relatively less attention. Containment seeks to localize the effect of the intrusion to some parts of the system while allowing the other parts to continue to provide service. In this paper, we present the design and implementation of an Adaptive Intrusion Tolerant System, ADEPTS, for automatically containing intrusions in a distributed system. ADEPTS uses a directed acyclic graph of intrusion goals, called I-DAG, and a graph of service interactions, called SNet, as the underlying representations in the system. The containment action in ADEPTS initially has the goal of preventing the spread of the intrusion by modifying its path of escalation in the I-DAG. Failing that, it adopts a more drastic response of modifying the interactions of the services in the SNet. There is also a feedback mechanism for the effectiveness of a deployed response and uses that in guiding future choices. ADEPTS is demonstrated on a distributed ecommerce system and evaluated using a survivability metric whose value depends on the operational services in the face of an intrusion.
"ADEPTS: Adaptive Intrusion Containment and Response using Attack Graphs in an E-Commerce Environment", Yu-Sung Wu, Bingrui Foo, Blake Matheny, Tyler Olsen, and Saurabh Bagchi. CERIAS Tech Report 2003-33. [ Presentation ] [ abstract ]

Distributed e-commerce systems are suitable targets for malicious attacks because of the potential financial impact. Intrusion detection in such systems has been an active area of research. Once an intrusion is detected, it is important to contain the effect of the intrusion to some parts of the system while allowing the other parts to continue to provide service. It is also important to take preventive or reactive response to reduce the likelihood of the system being compromised through a future attack. In this paper, we present the design and implementation of an Adaptive Intrusion Tolerant System, ADEPTS, for automatically containing and responding to intrusions in a distributed e-commerce system. We use a directed acyclic graph (DAG) of intrusion goals as the underlying representation in the system. In an I-DAG, the nodes are sub-goals of an attack and to reach a particular node, goals corresponding to its child nodes have to be achieved first. We assume an intrusion detection framework which provides alerts to ADEPTS. In response, a parallel algorithm is executed to compute the likelihood that one or more goals in the DAG have been achieved. Next, a response measure computation algorithm is executed to determine the appropriate response action. There is also a feedback mechanism which estimates the success or failure of a deployed response and uses that in adjusting the system weights to guide future choices. ADEPTS is implemented on a distributed e-commerce system that comprises services including, web server, application server, database server, directory server. Alerts are simulated corresponding to different attack types, the algorithms executed and response actions deployed. The experiments bring out the latency of the infrastructure, and the effectiveness in dealing with failed responses through escalation compared to statically mapped Intrusion Response Systems (IRS).
"Collaborative Intrusion Detection System (CIDS): A Framework for Accurate and Efficient IDS", Yu-Sung Wu, Bingrui Foo, Yongguo Mei, and Saurabh Bagchi. 19th Annual Computer Security Applications Conference, December 8-12, 2003, Las Vegas , Nevada . (Acceptance rate: 36/110 ~ 32.7% ). [ Presentation ] [ abstract ]

In this paper, we present the design and implementation of a Collaborative Intrusion Detection System (CIDS) for accurate and efficient intrusion detection in a distributed system. CIDS employs multiple specialized detectors at the different layers – network, kernel and application – and a manager based framework for aggregating the alarms from the different detectors to provide a combined alarm for an intrusion. The premise is that a carefully designed and configured CIDS can increase the accuracy of detection compared to individual detectors, without a substantial degradation in performance. In order to validate the premise, we present the design and implementation of a CIDS which employs Snort, Libsafe, and a new kernel level IDS called Sysmon. The manager has a graph-based and a Bayesian network based aggregation method for combining the alarms to finally come up with a decision about the intrusion. The system is evaluated using a web-based electronic store front application and under three different classes of attacks – buffer overflow, flooding and script-based attacks. The results show performance degradations compared to no detection of 3.9% and 6.3% under normal workload and a buffer overflow attack respectively. The experiments to evaluate the accuracy of the system show that the normal workload generates false alarms for Snort and the elementary detectors produce missed alarms. CIDS does not flag the false alarm and reduces the incidence of missed alarms to 1 of the 7 cases. CIDS can also be used to measure the propagation time of an intrusion which is useful in choosing an appropriate response strategy

PhD Theses

Bowen Zhou, "Techniques for Detecting Scalability Bugs". Defended: July 18, 2014:
[ Abstract ] [ Thesis ] [ Presentation ]
Fahad Ali Arshad, "Failure Characterization and Error Detection in Distributed Applications". Defended: April 23, 2014:
[ Abstract ] [ Thesis ] [ Presentation ]
Gaspar Modelo Howard, "Secure Configuration of Intrusion Detection Sensors for Changing Enterprise Systems;". Defended: December 11, 2012:
[ Abstract ] [ Thesis ] [ Presentation ]
Tanzima Zerin, "Reliable and Scalable Checkpointing Systems for Distributed Computing Environments". Defended: April 8, 2013:
[ Abstract ] [ Thesis ] [ Presentation ]
Ignacio Laguna, "Probabilistic Fault Detection and Diagnosis in Large-Scale Distributed Applications". Defended: November 27, 2012:
[ Abstract ] [ Thesis ] [ Presentation ]
Donghoon Shin, "Algorithms for Distributed Monitoring in Multi-Channel Ad Hoc Wireless Networks". Defended: July 19, 2012:
[ Abstract ] [ Thesis ] [ Presentation ]
Jinkyu Koo, "Secure Control Protocols for Resource-Constrained Embedded Systems". Defended: June 14, 2012:
[ Abstract ] [ Thesis ] [ Presentation ]
Rajesh Panta, "Remote Reprogramming of Wireless Sensor Networks". Defended: May 7, 2010:
[ Abstract ] [ Thesis ] [ Presentation ]
Sarah Sellke, "Analytical Characterization of Internet Security Attacks". Defended: Jan 7, 2010:
[ Abstract ] [ Thesis] [ Presentation ]
Yu-Sung Wu, "Achieving High Survivability in Distributed Systems through Automated Response". Defended: April 13, 2009.
[ Abstract ] [ Thesis ] [ Presentation ]
Gunjan Khanna, "Non Intrusive Detection and Diagnosis of Failures in High Throughput Distributed Systems". Defended: June 11, 2007. [ Abstract ] [ Thesis ] [ Presentation ]
Issa Khalil ,"Mitigation of Control and Data Traffic Attacks in Wireless Ad-Hoc and Sensor Networks". Defended: December 14, 2006. [ Abstract ] [ Thesis ]

Miscellaneous

"Effects of Types of Active Learning Activity on Two Junior-Level Computer Engineering Courses": Saurabh Bagchi, Mark C. Johnson, and Somali Chaterji. Accepted to appear as a full paper, 6 pages, 38th Annual Frontiers in Education (FIE) Conference, Saratoga Springs, New York, October 25-28, 2008.
"Impact of Research Technologies on Service Learning": Saurabh Bagchi, Carla B. Zoltowski, and William C. Oakes. Work In Progress, 2 pages, 38th Annual Frontiers in Education (FIE) Conference, Saratoga Springs, New York, October 25-28, 2008.

Copyright notice: Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional
purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work
in other works must be obtained from the appropriate publisher (IEEE, ACM, Elsevier, etc.)


Home	Projects	Publications	Presentations	People	News	Activities	About DCSL	Internal

	By Year	By Project

Fault tolerance for Distributed Applications
"Reliable and Efficient Distributed Checkpointing System for Grid Environments," Tanzima Zerin Islam, Saurabh Bagchi, and Rudolf Eigenmann. Journal of Grid Computing, December 2014, Volume 12, Issue 4, pp 593-613 [ Abstract ] In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion times. Checkpoint-recovery is an attractive mechanism for recovering from such “failures”. Today’s FGCS systems often use expensive, high-performance dedicated checkpoint servers. However, in geographically distributed clusters, this may incur high checkpoint transfer latencies. In this paper we present a distributed checkpointing system called FALCON that uses available disk resources of the FGCS machines as shared checkpoint repositories. However, an unavailable storage host may lead to loss of checkpoint data. Therefore, we model the failures of a storage host and develop a prediction algorithm for choosing reliable checkpoint repositories. We experiment with FALCON in the university-wide Condor testbed at Purdue and show improved and consistent performance for guest jobs in the presence of irregular resource availability. "Addressing failures in exascale computing," Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak et al. International Journal of High Performance Computing Applications (2014): doi:10.1177/1094342014522573, pp. 1-45 [ Abstract ] We present here a report produced by a workshop on ‘Addressing failures in exascale computing’ held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach. The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia, and their interests ranged from theory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions. "Mitigating Interference in Cloud Services by Middleware Reconfiguration," Amiya Maji, Subrata Mitra, Bowen Zhou, Saurabh Bagchi and Akshat Verma (IBM Research). Accepted to appear at the 15th ACM/IFIP/USENIX Middleware conference, pp. 1-12, Nov 16-21, 2014. (Acceptance rate: 27/144 = 18.8%) [ Presentation ] [ Abstract ] Application performance has been and remains one of top five concerns since the inception of cloud computing. A primary determinant of application performance is multi-tenancy or sharing of hardware resources in clouds. While some hardware resources can be partitioned well among VMs (such as CPUs), many others cannot (such as memory bandwidth). In this paper, we focus on understanding the variability in application performance on a cloud and explore ways for an end customer to deal with it. Based on rigorous experiments using CloudSuite, a popular Web2.0 benchmark, running on EC2, we found that interference-induced performance degradation is a reality. On a private cloud testbed, we also observed that interference impacts the choice of best configuration values for applications and middleware. We posit that intelligent reconfiguration of application parameters presents a way for an end customer to reduce the impact of interference. However, tuning the application to deal with interference is challenging because of two fundamental reasons --- the configuration depends on the nature and degree of interference and there are inter-parameter dependencies. We design and implement the IC2 system to address the challenges of detection and mitigation of performance interference in clouds. Compared to an interference-agnostic configuration, the proposed solution provides upto 29% and 40% improvement in average response time on EC2 and a private cloud testbed respectively. "Orion: Scaling Genomic Sequence Matching with Fine-Grained Parallelization," Kanak Mahadik, Somali Chaterji, Bowen Zhou, Milind Kulkarni, and Saurabh Bagchi. Accepted to appear at the International Conference for High Performance Computing, Networking, Storage, and Analysis (Supercomputing), pp. 1-11, Nov 16-21, 2014. (Acceptance rate: 82/394 = 20.8%) [ Presentation ] [ Abstract ] Gene sequencing instruments are producing huge volumes of data, straining the capabilities of current database searching algorithms and hindering efforts of researchers analyzing larger collections of data to obtain greater insights. In the space of parallel genomic sequence search, most of the popular softwares, like mpiBLAST, use the database segmentation approach, wherein the entire database is sharded and searched on different nodes. However this approach does not scale well with the increasing length of individual query sequences as well as the rapid growth in size of sequence databases. In this paper, we propose a fine-grained parallelism technique, called Orion, that divides the input query into an adaptive number of fragments and shards the database. Our technique achieves higher parallelism (and hence speedup) and load balancing, while maintaining 100% accuracy. We show that it is 12.3X faster than mpiBLAST for solving a relevant comparative genomics problem. "Is Your Web Server Suffering from Undue Stress due to Duplicate Requests?," Fahad A. Arshad, Amiya K. Maji, Sidharth Mudgal, and Saurabh Bagchi. Accepted as a Short Paper, At the 11th International Conference on Autonomic Computing (ICAC), pp. 105-111, June 18-20, 2014, Philadelphia, PA. (Acceptance rate: 12 (full papers) + 10 (short papers)/53 = 41.5%) [ Presentation ] [ Abstract ] An important, if not very well known, problem that afflicts many web servers is duplicate client browser requests due to server-side problems. A legitimate request is followed by a redundant request, thus increasing the load on the server and corrupting state at the server end (such as, the hit count for the page) and at the client end (such as, state maintained through a cookie). This problem has been reported in many developer blogs and has been found to afflict even popular web sites, such as CNN and YouTube. However, to date, there has not been a scientific, technical solution to this problem that is browser vendor neutral. In this paper, we provide such a solution which we call GRIFFIN. We identify that the two root causes of the problem are missing resource at the server end or duplicated Javascripts embedded in the page. We have the insight that dynamic tracing of the function call sequence creates a signature that can be used to differentiate between legitimate and duplicate requests. We apply our technique to find unreported problems in a large production scientific collaboration web service called HUBzero, which are fixed upon reporting the problems. Our experiments show an average overhead of 1.29X for tracing the PHP-runtime on HUBzero across 60 unique HTTP transactions. GRIFFIN has zero false-positives (when run across HTTP transaction of size one and two) and an average detection accuracy of 78% across 60 HTTP transactions. "Diagnosis of Performance Faults in Large Scale MPI Applications via Probabilistic Progress-Dependence Inference," Ignacio Laguna (LLNL), Dong Ahn (LLNL), Bronis de Supinski (LLNL), Saurabh Bagchi, and Todd Gamblin (LLNL), Accepted to appear in IEEE Transactions on Parallel and Distributed Systems (TPDS), pp. 1-15, notification of acceptance: March 2014. [ Abstract ] Debugging large-scale parallel applications is challenging. Most existing techniques provide little information about failure root causes. Further, most debuggers significantly slow down program execution, and run sluggishly with massively parallel applications. This paper presents a novel technique that scalably infers the tasks in a parallel program on which a failure occurred, as well as the code in which it originated. Our technique combines scalable runtime analysis with static analysis to determine the least-progressed task(s) and to identify the code lines at which the failure arose. We present a novel algorithm that infers probabilistically progress dependence among MPI tasks using a globally constructed Markov model that represents tasks' control-flow behavior. In comparison to previous work, our algorithm infers more precisely the least-progressed task. We combine this technique with static backward slicing analysis, further isolating the code responsible for the current state. A blind study demonstrates that our technique isolates the root cause of a concurrency bug in a molecular dynamics simulation, which only manifests itself at 7,996 tasks or more. We extensively evaluate fault coverage of our technique via fault injections in ten HPC benchmarks and show that our analysis takes less than a few seconds on thousands of parallel tasks. "Accurate Application Progress Analysis for Large-Scale Parallel Debugging," Subrata Mitra, Ignacio Laguna, Dong H. Ahn, Saurabh Bagchi, Martin Schulz, and Todd Gamblin. At the ACM International Symposium on Programming Language Design and Implementation (PLDI), pp. 193-203, Edinburgh, UK, June 9-11, 2014. (Acceptance rate: 52/287 = 18.1%) [ Presentation ] [ Abstract ] Debugging large-scale parallel applications is challenging. In most HPC applications, parallel tasks progress in a coordinated fashion, and thus a fault in one task can propagate quickly to other tasks, making it difficult to debug. Fortunately, finding the leastprogressed tasks can significantly reduce effort to identify the task where the fault originated. However, existing approaches to detecting them suffer low accuracy and large overheads; either they use imprecise static analysis or are unable to infer progress dependence inside loops. We present a loop-aware progress-dependence analysis tool, called PRODOMETER, which determines relative progress among parallel tasks using only dynamic analysis. Our fault-injection experiments suggest that its accuracy and precision are over 90% for most cases and that it scales well up to 16,384 MPI tasks. Further, our case study shows that it significantly helped diagnosing a perplexing error in MPI, which only manifested at large scale. “Characterizing Configuration Problems in Java EE Application Servers: An Empirical Study with GlassFish and JBoss,” Fahad A. Arshad, Rebecca J. Krause, and Saurabh Bagchi, At the 24th IEEE International Symposium on Software Reliability Engineering (ISSRE), pp. 198-207, Pasadena, CA, November 4-7, 2013. (Acceptance rate: 46/131 = 35.1%) [ Presentation ] We present a characterization study on configuration problems for Java EE application servers. Our study analyzes a total of 281 bug-reports in two phases: a longer (phase- 1) and a shorter (phase-2) phase, from bug tracking systems of two popular open source servers, GlassFish and JBoss. We study configuration problems in four orthogonal dimensions: problemtype, problem-time, problem-manifestation and problem-culprit. A configuration problem, by type, is classified as a paramater, compatibility or a missing-component problem. Problem-time is classified as pre-boot-time, boot-time or run-time. A configuration problem manifestation is either silent or non-silent. Problemculprit is either the user or the developer of the application server. Our analysis shows that more than one-third of all problems in each server are configuration problems. Among all configuration problems for each server in phase-1, at-least 50% of problems are paramater-based and occur at run-time. In phase- 2, which focuses on specific versions over a shorter time-period, all three problem types parameter, compatibility and missingcomponent have an almost equal share. Further, on average 89% of configuration problems result in a non-silent manifestation, while 91% of them are due to mistakes by the developer and require code-modification to fix the problem. Finally, we test the robustness to configuration by injecting configuration-bugs at boot-time with SPECjEnterprise2010 application deployed in each server. JBoss performs better than GlassFish with all of the injections giving a non-silent manifestation as opposed to only 65% non-silent manifestations in GlassFish. [ Abstract ] “Automatic Problem Localization in Distributed Applications via Multi-dimensional Metric Profiling,” Ignacio Laguna, Subrata Mitra, Fahad A. Arshad, Nawanol Theera-Ampornpunt, Zongyang Zhu, Saurabh Bagchi, Samuel P. Midkiff, Mike Kistler (IBM Research), and Ahmed Gheith (IBM Research), At the 32nd International Symposium on Reliable Distributed Systems (SRDS), pp. 121-132, Braga, Portugal, September 30-October 3, 2013. (Acceptance rate: 22/67 = 32.8%) [ Presentation ] [ Abstract ] Debugging today?s large-scale distributed applications is complex. Traditional debugging techniques such as breakpoint-based debugging and performance profiling require a substantial amount of domain knowledge and do not automate the process of locating bugs and performance anomalies. We present ORION, a framework to automate the problemlocalization process in distributed applications. From large set of metrics, ORION intelligently chooses important metrics and models the application?s runtime behavior through pairwise correlations of those metrics in the system, within multiple nonoverlapping time windows. When correlations deviate from those of a learned correct model due to a bug, our analysis pinpoints the metrics and code regions (class and method within it) that are most likely associated with the failure. We demonstrate our framework with several real-world failure cases in distributed applications such as: HBase, Hadoop DFS, a Purdue campuswide Java application for checking availability of lab machines, and a regression testing framework from IBM. Our results show that ORION is able to pinpoint the metrics and code regions that developers need to concentrate on to fix the failures. “WuKong: Automatically Detecting and Localizing Bugs that Manifest at Large System Scales,” Bowen Zhou, Jonathan Too, Milind Kulkarni, and Saurabh Bagchi. At the 22nd International ACM Symposium on High Performance Parallel and Distributed Computing (HPDC), pp. 1-12, New York City, New York, June 17-21, 2013. (Acceptance rate: 20/131 = 15.3%) [ Presentation ] [ Abstract ] A key challenge in developing large scale applications is finding bugs that are latent at the small scales of testing, but manifest themselves when the application is deployed at a large scale. Here, we ascribe a dual meaning to \large scale" - it could mean a large number of executing processes or applications ingesting large amounts of input data (or both). Traditional statistical techniques fail to detect or di- agnose such kinds of bugs because no error-free run is available at the large deployment scales for training purposes. Prior work used scaling models to detect anomalous behavior at large scales without training on correct behavior at that scale. However, that work cannot localize bugs automatically, i.e., cannot pinpoint the region of code responsible for the error. In this paper, we resolve that shortcoming by making the following three contributions: (i) we develop an automatic diagnosis technique, based on feature reconstruction; (ii) we design a heuristic to eectively prune the large feature space; and (iii) we demonstrate that our system scales well, in terms of both accuracy and overhead. We validate our design through a large-scale fault-injection study and two case-studies of real-world bugs, finding that our system can eectively localize bugs in 92.5% of the cases, dramatically easing the challenge of finding bugs in large- scale programs. “mcrEngine: A Scalable Checkpointing System using Data-Aware Aggregation and Compression,” Tanzima Zerin Islam, Kathryn Mohror, Saurabh Bagchi, Adam Moody, Bronis R. de Supinski, and Rudolf Eigenmann. At the IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (Supercomputing), pp. 1-10, Salt Lake City, Utah, November 10-16, 2012. (Acceptance rate: 100/472 = 21.2%) (One of 8 papers that was a finalist for the best student paper) [ Presentation ] [ Abstract ] High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint- restart system, MCRENGINE. MCRENGINE aggregates check- points from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that MCRENGINE reduces check- pointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression. "ABHRANTA: Locating Bugs that Manifest at Large System Scales," Bowen Zhou, Milind Kukarni, and Saurabh Bagchi. At the 8th Workshop on Hot Topics in System Dependability (HotDep) (co-located with OSDI '12), pp. 1-6, Hollywood, CA, October 7, 2012. (Acceptance rate: 10/24 = 41.7%) [ Presentation ] [ Abstract ] A key challenge in developing large scale applications (both in system size and in input size) is finding bugs that are latent at the small scales of testing, only manifesting when a program is deployed at large scales. Traditional statistical techniques fail because no error-free run is available at deployment scales for training purposes. Prior work used scaling models to detect anomalous behavior at large scales without being trained on correct behavior at that scale. However, that work cannot localize bugs automatically. In this paper, we extend that work with automatic diagnosis technique, based on feature reconstruction, and validate our design through case studies with two real bugs from an MPI library and a DHT-based file sharing application. “Probabilistic Diagnosis of Performance Faults in Large Scale Parallel Applications,” Ignacio Laguna, Dong H. Anh, Bronis R. de Supinski, Saurabh Bagchi, and Todd Gamblin. At the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 1-10, September 19-23, 2012, Minneapolis, MN. (Acceptance rate: 39/207 = 18.8%) [ Presentation ] [ Abstract ] Debugging large-scale parallel applications is challenging. Most existing techniques provide mechanisms for process control, but they provide little information about the causes of failures. Most debuggers also scale poorly despite contin- ued growth in supercomputer core counts. We have devel- oped a novel, highly scalable tool to help developers under- stand and fix correctness problems at scale. Our tool proba- bilistically infers the least progressed task in MPI programs using Markov models of execution history and dependence analysis. This analysis guides program slicing to find code that may have caused a failure. In a blind study, we demon- strate that our tool can isolate the root cause of a partic- ularly perplexing bug encountered at scale in a molecular dynamics simulation. Further, we perform fault injections into two benchmark codes and measure the scalability of the tool. Our results show that it accurately detects the least pro- gressed task in most cases and can perform the diagnosis in a fraction of a second with thousands of tasks. "Automatic Fault Characterization via Abnormality-Enhanced Classification," Greg Bronevetsky (LLNL), Ignacio Laguna, Saurabh Bagchi and Bronis R. de Supinski (LLNL). In the 42th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 8 pages, Boston, MA, June 25-28, 2012 [ Presentation ] [ Abstract ] Enterprise and high-performance computing systems are growing extremely large and complex, employing many processors and diverse software/hardware stacks [1]. As these machines grow in scale, faults become more frequent and system complexity makes it difficult to detect and diagnose them. The difficulty is particularly large for faults that degrade system performance or cause erratic behavior but do not cause outright crashes. The cost of these errors is high since they significantly reduce system productivity, both initially and by time required to resolve them. Current system management techniques [2], [3] do not work well since they require manual examination of system behavior and do not identify root causes. When a fault is manifested, system administrators need timely notification about the type of fault, the time period in which it occurred and the processor on which it originated. Statistical modeling approaches can accurately character- ize normal and abnormal system behavior [4]. However, the complex effects of system faults are less amenable to these techniques. This paper demonstrates that the com- plexity of system faults makes traditional classification and clustering algorithms inadequate for characterizing them. We design novel techniques that combine classification algorithms with information on the abnormality of appli- cation behavior to improve detection and characterization accuracy significantly. Our experiments demonstrate that our techniques can detect and characterize faults with 85% accuracy, compared to just 12% accuracy for direct applications of traditional techniques. "A Study of Soft Error Consequences in Hard Disk Drives," Timothy Tsai (Hitachi GST), Nawanol Theera-Ampornpunt and Saurabh Bagchi. In the 42th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 8 pages, Boston, MA, June 25-28, 2012 [ Presentation ] [ Abstract ] Hard disk drives have multiple layers of fault tolerance mechanisms that protect against data loss. However, a few failures occasionally breach the entire set of mechanisms. To prevent such scenarios, we rely on failure prediction mechanisms to raise alarms with sufficient warning to allow the at-risk data to be copied to a safe location. A common failure prediction technique monitors the occurrence of soft errors and triggers an alarm when the soft error rate exceeds a specified threshold. This study uses data collected from over 50,000 customer deployed disk drives to evaluate the performance of a failure prediction algorithm that relies solely on soft errors to predict failures manifested as hard errors. The data analysis shows that soft errors alone cannot be used as a reliable predictor of hard errors. However, in those cases where soft errors do accurately predict hard errors, sufficient warning time exists for preventive actions. "Large Scale Debugging of Parallel Tasks with AutomaDeD," Ignacio Laguna, Todd Gamblin, Bronis R. de Supinski, Saurabh Bagchi, Greg Bronevetsky, Dong H. Ahn, Martin Schulz, and Barry Rountree, At the Supercomputing Conference, 12 pages, Seattle, WA, Nov 12-18, 2011. (Acceptance rate: 74/352 = 21.0%) [ Presentation ] [ Abstract ] Developing correct HPC applications continues to be a challenge as the number of cores increases in today?s largest systems. Most existing debugging techniques perform poorly at large scales and do not automatically locate the parts of the parallel application in which the error occurs. The overhead of collecting large amounts of runtime information and an absence of scalable error detection algorithms generally cause poor scalability. In this work, we present novel, highly efficient techniques that facilitate the process of debugging large scale parallel applications. Our approach extends our previous work, AutomaDeD, in three major areas to isolate anomalous tasks in a scalable manner: (i) we efficiently compare elements of graph models (used in AutomaDeD to model parallel tasks) using pre-computed lookup-tables and by pointer comparison; (ii) we compress per-task graph models before the error detection analysis so that comparison between models involves many fewer elements; (iii) we use scalable sampling-based clustering and nearest-neighbor techniques to isolate abnormal tasks when bugs and performance anomalies are manifested. Our evaluation with fault injections shows that AutomaDeD scales well to thousands of tasks and that it can find anomalous tasks in under 5 seconds in an online manner. "Dangers and Joys of Stock Trading on the Web: Failure Characterization of a Three-Tier Web Service," Fahad Arshad and Saurabh Bagchi, At the 30th IEEE Symposium on Reliable Distributed Systems (SRDS), 10 pages, Madrid, Spain, Oct 5-7, 2011. (Acceptance rate: 30/88 = 34.1%) Detecting and isolating bugs that arise in parallel programs is a tedious and a challenging task. An especially subtle class of bugs are those that are scale-dependent: while small-scale test cases may not exhibit the bug, the bug arises in large-scale production runs, and can change the result or performance of an application. A popular approach to finding bugs is statistical bug detection, where abnormal behavior is detected through comparison with bug-free behavior. Unfortunately, for scale-dependent bugs, there may not be bug-free runs at large scales and therefore traditional sta- tistical techniques are not viable. In this paper, we propose Vrisha, a statistical approach to detecting and localizing scale-dependent bugs. Vrisha detects bugs in large-scale programs by building models of behavior based on bug-free behavior at small scales. These models are constructed using kernel canonical correlation analysis (KCCA) and exploit scale-determined properties, whose values are predictably dependent on application scale. We use Vrisha to detect and diagnose two bugs that appear in popular MPI libraries, and show that our techniques can be implemented with low overhead and low false-positive rates.[ [ Presentation ] [ Abstract ] Characterizing latent software faults is crucial to address dependability issues of current three-tier systems. A client should not have a misconception that a transaction succeeded, when in reality, it failed due to a silent error. We present a fault injection-based evaluation to characterize silent and non-silent software failures in a representative threetier web service, one that mimics a day trading application widely used for benchmarking application servers. For failure characterization, we quantify distribution of silent and nonsilent failures, and recommend low cost application-generic and application-specific consistency checks, which improve the reliability of the application. We inject three variants of nullcall, where a callee returns null to the caller without executing business logic. Additionally, we inject three types of unchecked exceptions and analyze the reaction of our application. Our results show that 49% of error injections from null-calls result in silent failures, while 34% of unchecked exceptions result in silent failures. Our generic-consistency check can detect silent failures in null-calls with an accuracy as high as 100%. Nonsilent failures with unchecked exceptions can be detected with an accuracy of 42% with our application-specific checks. Vrisha: Using Scaling Properties of Parallel Programs for Bug Detection and Localization,” Bowen Zhou, Milind Kulkarni, and Saurabh Bagchi, At the 20th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 12 pages, San Jose, California, June 8-11, 2011. (Acceptance rate: 22/170 = 12.9%) [ Presentation ] [ Abstract ] Detecting and isolating bugs that arise in parallel programs is a tedious and a challenging task. An especially subtle class of bugs are those that are scale-dependent: while small-scale test cases may not exhibit the bug, the bug arises in large-scale production runs, and can change the result or performance of an application. A popular approach to finding bugs is statistical bug detection, where abnormal behavior is detected through comparison with bug-free behavior. Unfortunately, for scale-dependent bugs, there may not be bug-free runs at large scales and therefore traditional sta- tistical techniques are not viable. In this paper, we propose Vrisha, a statistical approach to detecting and localizing scale-dependent bugs. Vrisha detects bugs in large-scale programs by building models of behavior based on bug-free behavior at small scales. These models are constructed using kernel canonical correlation analysis (KCCA) and exploit scale-determined properties, whose values are predictably dependent on application scale. We use Vrisha to detect and diagnose two bugs that appear in popular MPI libraries, and show that our techniques can be implemented with low overhead and low false-positive rates. "Characterizing Failures in Mobile OSes: A Case Study with Android and Symbian": Amiya Kumar Maji, Kangli Hao, Salmin Sultana, and Saurabh Bagchi. At the 21st annual International Symposium on Software Reliability Engineering (ISSRE 2010), 10 pages, Nov 1-4, 2010, San Jose, California. (Acceptance rate: 40/130 = 30.8%) [ Abstract ] As smart phones grow in popularity, manufacturers are in a race to pack an increasingly rich set of features into these tiny devices. This brings additional complexity in the system software that has to fit within the constraints of the devices (chiefly memory, stable storage, and power consumption) and hence, new bugs are revealed. How this evolution of smartphones impacts their reliability is a question that has been largely unexplored till now. With the release of open source OSes for hand-held devices, such as, Android (open sourced in October 2008) and Symbian (open sourced in February 2010), we are now in a position to explore the above question. In this paper, we analyze the reported cases of failures of Android and Symbian based on bug reports posted by thirdparty developers and end users and documentation of bug fixes from Android developers. First, based on 628 developer reports, our study looks into the manifestation of failures in different modules of Android and their characteristics, such as, their transience in time. Next, we analyze similar properties of Symbian bugs based on 153 failure reports. Our study indicates that Development tools, Web browsers, and Multimedia applications are most error-prone in both these systems. We further analyze 233 bug fixes for Android and categorized the different types of code modifications required for the fixes. The analysis shows that 78% of errors required minor code changes, with the largest share of these coming from modifications to attribute values and conditions. Our final analysis focuses on the relation between customizability, code complexity, and reliability in Android and Symbian. We find that despite high cyclomatic complexity, the bug densities in Android and Symbian are surprisingly low. However, the support for customizability does impact the reliability of mobile OSes and there are cautionary tales for their further development. "AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks": Greg Bronevetsky, Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin Schulz. In the 40th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 10 pages, June 28-July 1, 2010, Chicago, IL. (Acceptance rate (DCCS track): 40/174 = 23%) [ Presentation ] [ Abstract ] Today’s largest systems have over 100,000 cores, with million-core systems expected over the next few years. This growing scale makes debugging the applications that run on them a daunting challenge. Few debugging tools perform well at this scale and most provide an overload of information about the entire job. Developers need tools that quickly direct them to the root cause of the problem. This paper presents AutomaDeD, a tool that identifieswhich tasks of a large-scale application first manifest a bug at a specific code region and specific program execution point. AutomaDeD statistically models the application’s controlflow and timing behavior, grouping tasks and identifying deviations from normal execution, which significantly reduces debugging effort. In addition to a case study in which AutomaDeD locates a bug that occurred during development of MVAPICH, we evaluate AutomaDeD on a range of bugs injected into the NAS parallel benchmarks. Our results demonstrate that AutomaDeD detects the time period when a bug first manifested with 90% accuracy for stalls and hangs and 70% accuracy for interference faults. It identifies the subset of processes first affected by the fault with 80% accuracy and 70% accuracy, respectively and the code region where the fault first manifested with 90% and 50% accuracy, respectively. "How To Keep Your Head Above Water While Detecting Errors": Ignacio Laguna, Fahad A. Arshad, David M. Grothe, and Saurabh Bagchi. In: ACM/IFIP/USENIX 10th International Middleware Conference, November 30-December 4, 2009, Urbana-Champaign, Illinois. (Acceptance rate: 21/110 = 19.1%) [ Presentation ] [ abstract ] Today’s distributed systems need runtime error detection to catch errors arising from software bugs, hardware errors, or unexpected operating conditions. A prominent class of error detection techniques operates in a stateful manner, i.e., it keeps track of the state of the application being monitored and then matches state-based rules. Large-scale distributed applications generate a high volume of messages that can overwhelm the capacity of a stateful detection system. An existing approach to handle this is to randomly sample the messages and process a subset. However, this approach, leads to non-determinism with respect to the detection system’s view of what state the application is in. This in turn leads to degradation in the quality of detection. We present an intelligent sampling algorithm and a Hidden Markov Model (HMM)-based algorithm to select the messages that the detection system processes and determine the application states such that the non-determinism is minimized. We also present a mechanism for selectively triggering computationally intensive rules based on a light-weight mechanism to determine if the rule is likely to be flagged. We demonstrate the techniques in a detection system called Monitor applied to a J2EE multi-tier application. We empirically evaluate the performance of Monitor under different load conditions and error scenarios and compare it to a previous system called Pinpoint. "FALCON: A System for Reliable Checkpoint Recovery in Shared Grid Environments": Tanzima Zerin, Saurabh Bagchi, and Rudolf Eigenmann. In: the ACM/IEEE Supercomputing Conference, November 14-20, 2009, Portland, Oregon. (Acceptance rate: 59/261 = 22.6%) (Nominated as one of 4 best student papers) [ Presentation ] [ Abstract ] In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as the performance degradation is tolerable. For guest users, free resources come at the cost of unpredictable “failures”, where failures are defined as disruption in the guest job’s execution due to contention from the processes of the machine owner or the conventionally understood hardware and software failures. These unpredictable failures lead to unpredictable completion times. Checkpoint-recovery has long been used for providing reliability in failureprone computing environments. Today’s production FGCS systems, such as Condor, use expensive, high-performance dedicated checkpoint servers, even though they could take advantage of free disk resources offered by the clusters’ commodity machines. Also, in large, geographically distributed clusters, dedicated checkpoint servers may incur high checkpoint transfer latencies. In this paper we consider using available, free disk resources as shared storage hosts for serving as checkpoint repositories. Doing so raises new challenges in providing fault-tolerance, because a failing storage host may lead to a loss of saved application states. We model failures of such shared storage hosts and develop a prediction algorithm for such failures and then choosing an appropriate set of storage hosts. We describe the development of our system called Falcon in the production university-wide Condor testbed at Purdue University, named “BoilerGrid”. Through experiments on BoilerGrid, we show that Falcon provides improved and consistent performance to guest jobs by ensuring reliability and robustness in the presence of irregular resource availability. "Stateful Detection in High Throughput Distributed Systems": Gunjan Khanna, Ignacio Laguna, Fahad A. Arshad, and Saurabh Bagchi. In: 26th IEEE International Symposium on Reliable Distributed Systems (SRDS-2007), pp. 275-287, Beijing, CHINA, October 10-12, 2007. (Acceptance rate: 29/185 ~ 15.7%) [ Presentation ] [ abstract ] With the increasing speed of computers and the complexity of applications, many of today’s distributed systems exchange data at a high rate. Significant work has been done in error detection achieved through external fault tolerance systems. However, the high data rate coupled with complex detection can cause the capacity of the fault tolerance system to be exhausted resulting in low detection accuracy. We present a new stateful detection mechanism which observes the exchanged application messages, deduces the application state, and matches against anomalybased rules. We extend our previous framework (the Monitor) to incorporate a sampling approach which adjusts the rate of verified messages. The sampling approach avoids the previously reported breakdown in the Monitor capacity at high application message rates, reduces the overall detection cost and allows the Monitor to provide accurate detection. We apply the approach to a reliable multicast protocol (TRAM) and demonstrate its performance by comparing it with our previous framework. "Distributed Diagnosis of Failures in a Three Tier E-Commerce System": Gunjan Khanna, Ignacio Laguna, Fahad A. Arshad, and Saurabh Bagchi. In: 26th IEEE International Symposium on Reliable Distributed Systems (SRDS-2007), pp. 185-198, Beijing, CHINA, October 10-12, 2007. (Acceptance rate: 29/185 ~ 15.7%) [ Presentation ] [ abstract ] For dependability outages in distributed internet infrastructures, it is often not enough to detect a failure, but it is also required to diagnose it, i.e., to identify its source. Complex applications deployed in multi-tier environments make diagnosis challenging because of fast error propagation, black-box applications, high diagnosis delay, the amount of states that can be maintained, and imperfect diagnostic tests. Here, we propose a probabilistic diagnosis model for arbitrary failures in components of a distributed application. The monitoring system (the Monitor) passively observes the message exchanges between the components and, at runtime, performs a probabilistic diagnosis of the component that was the root cause of a failure. We demonstrate the approach by applying it to the Pet Store J2EE application, and we compare it with Pinpoint by quantifying latency and accuracy in both systems. The Monitor outperforms Pinpoint by achieving comparably accurate diagnosis with higher precision in shorter time. "Automated Rule-Based Diagnosis through a Distributed Monitor System": Gunjan Khanna, Mike Yu Cheng, Padma Varadharajan, Saurabh Bagchi, Miguel P. Correia, and Paulo J. Verissimo. In: IEEE Transactions on Dependable and Secure Computing (TDSC), pp. 266-279, notificacion of acceptance: May 2007. [ abstract ] In today's world where distributed systems form many of our critical infrastructures, dependability outages are becoming increasingly common. In many situations, it is necessary to not just detect a failure, but also to diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challenging since high throughput applications with frequent interactions between the different components allow fast error propagation. It is desirable to consider applications as black-boxes for the diagnostic process. In this paper, we propose a Monitor architecture for diagnosing failures in large-scale network protocols. The Monitor only observes the message exchanges between the protocol entities (PEs) remotely and does not access internal protocol state. At runtime, it builds a causal graph between the PEs based on their communication and uses this together with a rule base of allowed state transition paths to diagnose the failure. The tests used for the diagnosis are based on the rule base and are assumed to have imperfect coverage. The hierarchical Monitor framework allows distributed diagnosis handling failures at individual Monitors. The framework is implemented and applied to a reliable multicast protocol executing on our campus-wide network. Fault injection experiments are carried out to evaluate the accuracy and latency of the diagnosis. "Failure-Aware Checkpointing in Fine-Grained Cycle Sharing Systems": Xiaojuan Ren, Rudolf Eigenmann, and Saurabh Bagchi. In: 16th IEEE International Symposium on High Performance Distributed Computing (HPDC-16), Monterey Bay, California, June 27-29, 2007. (Acceptance rate: 20%). [ Presentation ] [ abstract ] Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amount of idle computational resources available on the Internet. Such systems allow guest jobs to run on a host if they do not significantly impact the local users of the host. Since the hosts are typically provided voluntarily, their availability fluctuates greatly. To provide fault tolerance to guest jobs without adding significant computational overhead, we propose failure-aware checkpointing techniques that apply the knowledge of resource availability to select checkpoint repositories and to determine checkpoint intervals. We present the schemes of selecting reliable and efficient repositories from the non-dedicated hosts that contribute their disk storage. These schemes are formulated as 0/1 programming problems to optimize the network overhead of transferring checkpoints and the work lost due to unavailability of a storage host when needed to recover a guest job. We determine the checkpoint interval by comparing the cost of checkpointing immediately and the cost of delaying that to a later time, which is a function of the resource availability. We evaluate the failure-aware techniques on an FGCS system called iShare, using tracebased simulation. The results show that our techniques achieve better application performance than the prevalent methods which use checkpointing with a fixed periodicity on dedicated checkpoint servers. "Prediction of Resource Availability in Fine-Grained Cycle Sharing Systems and Empirical Evaluation", Xiaojuan Ren, Seyong Lee, Rudolf Eigenmann, and Saurabh Bagchi. In: Springer's Journal of Grid Computing (JOGC), vol. 5, no. 2, pp. 173-195, notification of acceptance: February 2007. [ abstract ] Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amount of computational resources available on the Internet. In FGCS, host computers allow guest jobs to utilize the CPU cycles if the jobs do not significantly impact the local users. Such resources are generally provided voluntarily and their availability fluctuates highly. Guest jobs may fail unexpectedly, as resources become unavailable. To improve this situation, we consider methods to predict resource availability. This paper presents empirical studies on resource availability in FGCS systems and a prediction method. From studies on resource contention among guest jobs and local users, we derive a multi-state availability model. The model enables us to detect resource unavailability in a non-intrusive way. We analyzed the traces collected from a production FGCS system for three months. The results suggest the feasibility of predicting resource availability, and motivate our method of applying semi-Markov Process models for the prediction. We describe the prediction framework and its implementation in a production FGCS system, named iShare. Through the experiments on an iShare testbed, we demonstrate that the prediction achieves an accuracy of 86% on average and outperforms linear time series models, while the computational cost is negligible. Our experimental results also show that the prediction is robust in the presence of irregular resource availability. We tested the effectiveness of the prediction in a proactive scheduler. Initial results show that applying availability prediction to job scheduling reduces the number of jobs failed due to resource unavailability. "Pesticide: Using SMT Processors to Improve Performance of Pointer Bug Detection", Jin-Yi Wang, Yen-Shiang Shue, T N Vijaykumar, and Saurabh Bagchi. 24th International Conference of Computer Design (ICCD), Oct 1-4, 2006, San Jose, California, USA. [ abstract ] Pointer bugs associated with dynamically-allocated objects resulting in out-of-bounds memory access are an important class of software bugs. Because such bugs cannot be detected easily via static-checking techniques, dynamic monitoring schemes have been proposed. However, the key challenge with dynamic monitoring schemes is the runtime overhead (slowdowns of the order of 10x are common). Previous approaches have used thread-level speculation (TLS) to reduce the overhead. However, the approaches still incur substantial slowdowns while requiring complex TLS hardware. We make the key observation that because the monitor code and user code are largely and unambiguously independent, TLS hardware with all its complexity to handle speculative parallelism is unnecessary. We explicitly multithread the monitor code in which a thread checks one access and use SMT to exploit the parallelism in the monitor code. Despite multithreading the monitor code on SMT, dynamic monitoring slows down the user thread due to two problems: instruction overhead and insufficient overlap among the monitor threads. To address instruction overhead, we exploit the natural locality in the user thread addresses and memoize recent checks in a small table called the allocation-record-cache (ARC). However, programs making and accessing many small memory allocations cause many ARC misses and incur significant runtime overhead. To address this issue, we make a second key observation that because adjacent memory objects result in ARC entries with contiguous address ranges, the entries can be merged into one by simply merging the ranges into one. This merging increases the effective size of the ARC. Finally, insufficient overlap among monitor threads occurs because of inefficient synchronization to protect the allocation data structure updated by the user thread and read by the monitor threads. We make the third key observation that because monitor-thread reads occur for every check but user-thread writes occur only in allocations and deallocations, monitor reads are much more frequent than user writes. We propose a locking strategy, called biased lock, which puts the locking overhead on the writer away from the readers. We show that starting from a runtime overhead of 414% our scheme reduces this overhead to a respectable 24% running three monitor threads on an SMT using a 256-entry ARC with merging and biased lock. "Providing Automated Detection of Problems in Virtualized Servers using Monitor framework", Gunjan Khanna, Saurabh Bagchi, Kirk Beaty, Andrzej Kochut, and Gautam Kar. Workshop on Applied Software Reliability (WASR) at the International Conference on Dependable Systems and Networks (DSN), June 25-28, 2006, Philadelphia, Pennsylvania, USA. [ Presentation ] [ abstract ] Increasing management costs with large servers have caused adoption of a technique called Server Virtualization whereby multiple virtual machines can be hosted on the same physical machine. Reduction in the number of physical machines and amount of rack space brings the total cost of ownership down. Virtual servers residing on the same physical machine are likely to suffer resource contention under varying workload conditions. We are targeting distributed environments where some services are executed on virtual servers. Existing management frameworks fail to address the problems faced in such environments because of a narrow focus only at system metrics which are not sufficient to explain the application problems. Also the management frameworks cannot detect problems arising out of virtualization. In this paper we propose a hierarchical, scalable, non-intrusive Monitor framework which addresses not only system level problems but also problems in application semantics. We show how design of Monitor framework is different from existing management frameworks making it suitable for virtualized environments. We demonstrate the specific class of problems which the Monitor can detect in a virtualized environment running an e-commerce application. "Resource Failure Prediction in Fine-Grained Cycle Sharing Systems", Xiaojuan Ren, Seyong Lee, Rudolf Eigenmann, and Saurabh Bagchi. 15th IEEE International Symposium on High Performance Distributed Computing (HPDC-15), 19-23 June 2006, Paris, France. (Acceptance rate: 24/157 ~ 15%). [ Presentation ] [ abstract ] Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amount of computational resources available on the Internet. In FGCS, host computers allow guest jobs to utilize the CPU cycles if the jobs do not significantly impact the local host users. A characteristic of such resources is that they are generally provided voluntarily and their availability fluctuates highly. Guest jobs may incur resource failures because of unexpected resource unavailability. Checkpointing and migration techniques help overcome such failures. However, these techniques, if oblivious to future failures, may cause significant overhead and thus undesirable job response times. This paper presents a method to predict resource failures in FGCS systems. The prediction method enables proactive management with greatly improved job response times. It applies a semi-Markov Process and is based on a novel failure model, combining generic hardware-software failures with domain-specific failures in FGCS. We describe the failure prediction framework and its implementation in a production FGCS system named iShare. Through the experiments on an iShare testbed, we demonstrate that the prediction achieves accuracy above 86% on average and outperforms linear time series models, while the computational cost is negligible. Our experimental results also show that the prediction is robust in the presence of irregular resource failures. "Automated Online Monitoring of Distributed Applications through External Monitors", Gunjan Khanna, Padma Varadharajan, and Saurabh Bagchi. IEEE Transactions on Dependable and Secure Computing (TDSC), vol. 3, no. 2, pp. 115-129, Apr-Jun, 2006. [ abstract ] It is a challenge to provide detection facilities for large scale distributed systems running legacy code on hosts that may not allow fault tolerant functions to execute on them. It is tempting to structure the detection in an observer system that is kept separate from the observed system of protocol entities, with the former only having access to the latter’s external message exchanges. In this paper we propose an autonomous self checking Monitor system, which is used to provide fast detection to underlying network protocols. The Monitor architecture is application neutral and therefore lends itself to deployment for different protocols, with the rulebase against which the observed interactions are matched making it specific to a protocol. To make the detection infrastructure scalable and dependable, we extend it to a hierarchical Monitor structure. The Monitor structure is made dynamic and reconfigurable by designing different interactions to cope with failures, load changes, or mobility. The latency of the Monitor system is evaluated under fault free conditions, while its coverage is evaluated under simulated error injections. "Probabilistic Diagnosis through Non-Intrusive Monitoring in Distributed Applications", Gunjan Khanna, Yu Cheng, Saurabh Bagchi, Miguel Correia, and Paolo Verissimo. Purdue ECE Technical Report 05-19, December 2005. [ abstract ] With dependability outages in distributed critical infrastructures, it is often not enough to detect a failure, but it is also required to diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challenging because fast error propagation may occur in high throughput distributed applications. The diagnosis often needs to be probabilistic in nature due to imperfect observability of the payload system, inability to do white-box testing, constraints on the amount of state that can be maintained at the diagnostic process, and imperfect tests used to verify the system. In this paper, we extend an existing Monitor architecture, for probabilistic diagnosis of failures in large-scale network protocols. The Monitor only observes the message exchanges between the protocol entities (PEs) remotely and does not access internal protocol state. At runtime, it builds a causal & aggregate graph between the PEs based on their communication and uses this together with a rule base for diagnosing the failure. The Monitor computes for each suspected PE, a probability for the error having originated in that PE and propagated to the failure detection site. The framework is applied to a test-bed consisting of a reliable multicast protocol executing on the Purdue campus-wide network. Error injection experiments are performed to evaluate the accuracy and the performance overhead of the diagnostic process. "Automated Monitor Based Diagnosis in Distributed Systems", Gunjan Khanna, Padma Varadharajan, Mike Cheng, and Saurabh Bagchi, Purdue ECE Technical Report 05-13, August 2005. [ abstract ] In today's world where distributed systems form many of our critical infrastructures, dependability outages are becoming increasingly common. In many situations, it is necessary to not just detect a failure, but also to diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challenging since high throughput applications with frequent interactions between the different components allow fast error propagation. It is desirable to consider applications as black-boxes for the diagnosis process. In this paper, we propose a Monitor architecture for diagnosing failures in large-scale network protocols. The Monitor only observes the message exchanges between the protocol entities (PEs) remotely and does not access internal protocol state. At runtime, it builds a causal graph between the PEs based on their communication and uses this together with a rule base of allowed state transition paths to diagnose the failure. The hierarchical Monitor framework allows distributed diagnosis tolerating Byzantine failures at individual Monitors. The framework is implemented and applied to a reliable multicast protocol executing on our campus-wide network. Fault injection experiments are carried out to evaluate the accuracy and latency of the diagnosis. "LRRM: A Randomized Reliable Multicast Protocol for Optimizing Recovery Latency and Buffer Utilization", Nipoon Malhotra, Shrish Ranjan, and Saurabh Bagchi. 24th IEEE Symposium on Reliable Distributed Systems (SRDS 2005), October 26-28, 2005, Orlando, Florida, USA.(Acceptance rate: 20/67 ~ 29.9%) [ Camera ready ] [ abstract ] An efficient recovery protocol for lost messages is crucial for supporting reliable multicasting. The treebased recovery protocols group nodes into recovery regions and designate a recovery node per region for buffering and retransmitting lost messages. In these protocols, the recovery host may get overloaded during periods of large message losses and costly remote recovery may be initiated even though a peer node has the lost message. To address these drawbacks, the Randomized Reliable Multicast Protocol (RRMP) was proposed which distributes the responsibility of error recovery among all members in a group. The pressure on the buffer and computational resources on the intermediate nodes is increasing due to the wide distribution of multicast participants with widely varying reception rates and periodic disconnections. In this paper, we propose the Lightweight Randomized Reliable Multicast (LRRM) protocol that optimizes the amount of buffer space by providing an efficient mechanism based on best-effort multicast for retrieving a lost message. A theoretical analysis and a simulation based study of two realistic topologies indicate that LRRM provides comparable recovery latency to RRMP for lower buffer space usage. While presented in the context of RRMP, LRRM can also benefit other treebased reliable multicast protocols. "Self Checking Network Protocols: A Monitor Based Approach", Gunjan Khanna, Padma Varadharajan, and Saurabh Bagchi. 23rd International Symposium on Reliable Distributed Systems (SRDS 2004), October 2004. (Acceptance rate:27/117 ~ 23.1%) [ Camera Ready ] [ Presentation ] [ abstract ] The wide deployment of high-speed computer networks has made distributed systems ubiquitous in today’s connected world The machines on which the distributed applications are hosted are heterogeneous in nature, the applications often run legacy code without the availability of their source code, the systems are of very large scales, and often have soft real-time guarantees. In this paper, we target the problem of online detection of disruptions through a generic external entity called Monitor that is able to observe the exchanged messages between the protocol participants and deduce any ongoing disruption by matching against a rule base composed of combinatorial and temporal rules. The Monitor architecture is application neutral, with the rule base making it specific to a protocol. To make the detection infrastructure scalable and dependable, we extend it to a hierarchical Monitor structure. The infrastructure is applied to a streaming video application running on a reliable multicast protocol called TRAM installed on the campus wide network. The evaluation brings out the scalability of the Monitor infrastructure and detection coverage under different kinds of faults for the single level and the hierarchical arrangements. "Self-Checking Network Protocols: A Monitor Based Approach", Gunjan Khanna, MS Thesis. December 2003. [ abstract ] The wide deployment of high-speed computer networks has made distributed systems ubiquitous in today’s connected world. The systems are affected by disruption i.e. errors within the protocol or intrusions. This motivates the need for building distributed systems that are capable of tolerating disruptions and providing highly available and correctly functioning services. The machines on which the applications are hosted are heterogeneous in nature, the applications often run legacy code without the availability of their source code, the systems are of very large scales (of the order of tens of thousands of protocol participants) and the systems often have soft real-time guarantees. While it may be possible to devise very optimized and targeted solutions for individual distributed applications, such approaches are not very interesting from a research standpoint due to their limited applicability. In developing this thesis we have focused on Monitor based detection of disruptions in a distributed environment. Monitor detects the disruptions by looking at only the external message exchanges, without looking at the internal transitions of the monitored entity. It is made to run asynchronously to the application thus avoiding the performance bottleneck. We have chosen a black box Monitor approach suitable for any generic protocol. By developing the "Monitor Based Detection Approach", aim is to provide higher reliability and dependability. We propose a Hierarchical Monitoring approach by placing a hierarchy of local and Global Monitors in the system. A Local Monitor only monitors a set of local nodes while a Global Monitor can have several local monitors reporting local interactions to it. This provides increased coverage and accuracy of detection. The Monitor consists of a Rule Classifier, Data Capture and Matching Engine as the main components. The rules are classified into Local and Global rules intelligently by the rule classifier. The Matching Engine consists of fast matching algorithms each for Temporal and Combinatorial rules. Testing of the Monitor is done on a Distributed Reliable Multicast Protocol called TRAM. The Monitor is tested by injecting faults into the running protocol using a Fault Injector. "Failure Handling in a Reliable Multicast Protocol for Improving Buffer Utilization and Accommodating Heterogeneous Receivers", Gunjan Khanna, John Rogers, and Saurabh Bagchi. In Proceedings of the 10th IEEE Pacific Rim Dependable Computing Conference (PRDC' 04), March 2004. (Acceptance rate: 34/102 ~ 33.3%) [ Camera ready ] [ abstract ] Reliable multicast protocols are an important class of protocols for reliably disseminating information from a sender to multiple receivers in the face of node and link failures. A Tree-based Reliable Multicast Protocol (TRAM) provides scalable reliable multicast by grouping receivers in hierarchical repair groups and using a selective acknowledgment mechanism. In this paper, we present an improvement to TRAM to minimize the resource utilization at intermediate hosts and to localize the effect of slow or malicious receivers on normal receivers. We present an evaluation of the existing protocol on a campus-wide wide area network (WAN) with respect to end-to-end parameters (latency, jitter, data rate), and resource utilization (buffer utilization and memory usage at sender and intermediate hosts). A high-quality streaming mpeg video application is used as the workload for the study. The evaluation brings out the effect of scaling the number of receivers on the parameters. The study is done for the error-free case and with message delay, drop and reordering errors. Then we present the design of an augmented protocol, called TRAM++, that optimizes the buffer requirement at intermediate hosts and improves the end-to-end parameters for normal receivers in the presence of some slow or malicious receivers. An implementation of TRAM++ is provided and it is evaluated on the campus-wide WAN with and without errors. The evaluation brings out that, given a constraint on the buffer availability at intermediate hosts of 16% of maximum buffer usage in TRAM, TRAM++ can tolerate the constraint at the expense of increasing the end-to-end latency for the normal receivers by only 3.2% in error-free cases. When slow or faulty receivers are present, TRAM++ is able to provide the same uninterrupted quality of service to the normal nodes while localizing the effect of the faulty ones without incurring any additional memory overhead. "Light-Weight Randomized Reliable Multicasting Protocol", Nipoon Malhotra, Shrish Ranjan, and Saurabh Bagchi. Appeared in Fast Abstracts, DSN 2003. [ abstract ] Multicasting is an efficient way of distributing data from a sender to multiple receivers. There has been considerable interest in augmenting the best-effort nature of IP multicast protocols to support reliable multicast capable of tolerating node crashes and message losses. The basic tree-based protocols (RMTP, TRAM) suffer from the problem that under failures, a local designated recovery host may get overloaded, and costly remote recovery may be performed even if a host in the local region has the message being requested. A protocol proposed to solve the problems is the Randomized Reliable Multicast Protocol (RRMP). Buffer management techniques proposed for RRMP [2] involve a trade off between lost message recovery latency and the amount of the buffer space used. In this paper, we propose a protocol called Light-weight Randomized Reliable Multicast (LRRM) which uses an alternative lost message recovery strategy that leads to lesser buffer requirement without compromising the recovery latency. As the reliable multicast protocols are deployed over wide area networks, it is a likely scenario that the intermediate nodes are light-weight and constrained in their buffer space and processing capabilities. Also the receivers may have widely varying reception rates and periods of disconnection resulting in large buffer space requirements. This motivates LRRM.

Dependable Cyber-Physical Systems
"Optimizing Defensive Investments in Energy-Based Cyber-Physical Systems, Paul Wood, Saurabh Bagchi, and Alefiya Hussain. Accepted to appear at the Dependable Parallel, Distributed and Network-Centric Systems (DPDNS) Workshop, to be held with 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), pp. 1-10, May 25-29, 2015, Hyderabad, India. [Presentation ] [ Abstract ] Interdependent cyber-physical systems (CPS) connect physical resources between independently motivated actors who seek to maximize profits while providing physical services to consumers. Cyberattacks in seemingly distant parts of these systems have local consequences, and techniques are needed to analyze and optimize defensive costs in the face of increasing cyber threats. This paper presents a technique for transforming physical interconnections between independent actors into a dependency analysis that can be applied to find optimal defensive investment strategies to protect assets from financially motivated adversaries in electric power grids. "TARDIS: Software-Only System-Level Record and Replay in Wireless Sensor Networks, "Matthew Tancreti, Vinaitheerthan Sundaram, Saurabh Bagchi, and Patrick Eugster. Accepted to appear at the 14th ACM/IEEE Conference on Information Processing in Sensor Networks (IPSN), pp. 1-12, April 13-17, 2015, Seattle, WA. (Acceptance rate: 27/111 = 24.3%) [Presentation ] [ Abstract ] Wireless sensor networks (WSNs) are plagued by the possibility of bugs manifesting only at deployment. However, debugging deployed WSNs is challenging for several reasons— the remote location of deployed sensor nodes, the non-determinism of execution that can make it difficult to replicate a buggy run, and the limited hardware resources available on a node. In particular, existing solutions to record and replay debugging in WSNs fail to capture the complete code execution, thus negating the possibility of a faithful replay and causing a large class of bugs to go unnoticed. In short, record and replay logs a trace of predefined events while a deployed application is executing, enabling replaying of events later using debugging tools. Existing recording methods fail due to the many sources of non-determinism and the scarcity of resources on nodes. In this paper we introduce Trace And Replay Debugging In Sensornets (Tardis), a software-only approach for deterministic record and replay of WSN nodes. Tardis is able to record all sources of non-determinism, based on the observation that such information is compressible using a combination of techniques specialized for respective sources. Despite their domain-specific nature, the techniques presented are applicable to the broader class of resource-constrained embedded systems. We empirically demonstrate the viability of our approach and its effectiveness in diagnosing a newly discovered bug in a widely used routing protocol. "Using Big Data for Improving Dependability: A Cellular Network Tale," Nawanol Theera-Ampornpunt, Saurabh Bagchi; Kaustubh Joshi, Rajesh Panta (AT&T Labs). At the 9th Workshop on Hot Topics in Dependable Systems (HotDep), held in conjunction with the 24th ACM Symposium on Operating Systems Principles (SOSP), pp. 1-6, November 3, 2013, Nemacolin Woodlands Resort, PA. (Acceptance rate: 11/21 = 52.3%) [ Abstract ] There are many large infrastructures that instrument everything from network performance metrics to user activities. However, the collected data are generally used for long-term planning instead of improving reliability and user experience in real time. In this paper, we present our vision of how such collections of data can be used in real time to enhance the dependability of cellular network services. We first discuss mitigation mechanisms that can be used to improve reliability, but incur a high cost which prohibit them to be used except in certain conditions. We present two case studies where analyses of real cellular network traffic data show that we can identify these conditions. "Low-Complexity Secure Protocols to Defend Cyber-Physical Systems Against Network Isolation Attacks," Dong-Hoon Shin, Jinkyu Koo, Lei Yang (Arizona State U.), Xiaojun Lin, Saurabh Bagchi, and Junshan Zhang (Arizona State U.). At the 1st IEEE Conference on Communications and Network Security (CNS), pp. 1-9, Washington DC, October 14-16, 2013. (Acceptance rate: 40/141 = 28.4%) [ Abstract ] This paper studies the network isolation attack, a devastating type of attacks on cyber-physical systems. In this attack, an adversary compromises a set of nodes that enclose a region in order to isolate the region from the rest of the network. Assuming that the compromised nodes wish not to be detected, we propose a solution to defend against the network isolation attack. Our goal is to achieve the following security guarantee: either a legitimate node can successfully deliver a message to another legitimate node, or the network control center can identify a small set of suspect nodes, which are guaranteed to contain a compromised node. Toward achieving this goal, we develop two protocols: one is for secure delivery of messages among nodes and the other is for secure collection of messages from nodes at the network control center. We show that our proposed protocols are provably secure, i.e., attain the aforementioned security guarantee. Further, our protocols achieve this guarantee with overhead that is orders-of-magnitude smaller than existing baseline protocols. Our proposed protocols are thus scalable for large networks. "A Delay-Bounded Event-Monitoring and Adversary-Identification Protocol in Resource-Constrained Sensor Networks," Jinkyu Koo, Dong-Hoon Shin, Xiaojun Lin, and Saurabh Bagchi. Elsevier Ad Hoc Networks, pp. 1820-1835, vol. 11, issue 6, August 2013. [ Abstract ] Event monitoring is a common application in wireless sensor networks. For event monitoring, a number of sensor nodes are deployed to monitor certain phenomenon. When an event is detected, the sensor nodes report it to a base station (BS), where a network operator can take appropriate action based on the event report. In this paper, we are interested in scenarios where the event must be reported within a time bound to the BS possibly over multiple hops. However, such event reports can be hampered by compromised nodes in the middle that drop, modify, or delay the event report. To defend against such an attack, we propose SEM, a Secure Event Monitoring protocol against arbitrary malicious attacks by Byzantine adversary nodes. SEM provides the following provable security guarantees. As long as the compromised nodes want to stay undetected, a legitimate sensor node can report an event to the BS within a bounded time. If the compromised nodes prevent the event from being reported to the BS within the bounded time, the BS can identify a pair of nodes that is guaranteSchool of Electrical and Computer Engineeringed to contain at least one compromised node. To the best of our knowledge, no prior work in the literature can provide such guarantees. SEM is designed to use the minimumlevel of asymmetric cryptography during normal operationwhenthere is no attack, and use cryptographic primitives more liberallywhenan attack is detected. This design has the advantage that the overall SEM protocol is lightweight in terms of the computational resources and the network traffic required by the cryptographic operations. We also show an operational example of SEM using TOSSIM simulations. "An Optimization Framework for Monitoring Multi-Channel Multi-Radio Wireless Mesh Networks," Dong-Hoon Shin and Saurabh Bagchi. Elsevier Ad Hoc Networks, pp. 926-943, Volume 11, Issue 3, May 2013. [ Abstract ] This paper studies an optimal monitoring problem for behavior-based detection in multichannel multi-radio wireless mesh networks. In behavior-based detection, nodes overhear communications in their neighborhood to determine if the behaviors of their neighbors are legitimate. The objective of this work is to maximize the number of nodes being monitored by judiciously choosing a set of monitoring nodes and also channels for the chosen monitoring nodes. This problem is NP-hard, growing exponentially with the number of monitoring nodes. We develop three approximation algorithms, each of which achieves at least a constant factor of the optimum. Furthermore, one of our algorithms achieves the best possible approximation ratio among all polynomial-time algorithms, unless P = NP. We conduct simulations in random networks and scale-free networks to evaluate the coverage and the execution time of the three algorithms. "Lilliput meets Brobdingnagian: Data Center Systems Management through Mobile Devices," Saurabh Bagchi, Fahad Arshad, Jan Rellermeyer (IBM Research), Thomas Osiecki (IBM Research), Michael Kistler (IBM Research), and Ahmed Gheith (IBM Research). At the 3rd International Workshop on Dependability of Clouds, Data Centers and Virtual Machine Technology (DCDV 2013), held with the 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1-6, June 24, 2013, Budapest, Hungary. [ Presentation ] [ Abstract ] In this paper, we put forward the notion that systems management for large masses of virtual machines in data centers is going to be done differently in the short to medium term future?through smart phones and through controlled crowdsourcing to a variety of experts within an organization, rather than dedicated system administrators alone. We lay out the research and practitioner challenges this model raises and give some preliminary solution directions that are being developed, here at IBM and elsewhere. "Toward Optimal Sniffer-Channel Assignment for Reliable Monitoring in Multi-Channel Wireless Networks," Dong-Hoon Shin, Saurabh Bagchi, and Chih-Chun Wang. At the 10th IEEE International Conference on Sensing, Communication, and Networking (SECON), pp. 1-9, New Orleans, LA, June 24-27, 2013. (Acceptance rate: 51/173 = 29.5%) [ Presentation ] [ Abstract ] This paper studies the optimal sniffer-channel assignment for reliable monitoring in multi-channel wireless networks. This problem concerns how to deploy certain sniffers in a network (and tune their channels) so that they can overhear and verify communication among the other nodes, referred to as normal nodes. Prior works have studied the optimal sniffer-channel assignment, but they assume perfect sniffers. However, in practice, sniffers may probabilistically make errors in monitoring, e.g., due to poor reception and compromise by an adversary. Hence, to maintain acceptable monitoring quality, a node needs to be overheard by multiple sniffers. We show that the optimal sniffer-channel assignment with sniffer redundancy differs fundamentally from the previous works due to the absence of a desirable property called submodularity. As a result, in our problem, the prior approximation algorithms no longer maintain their performance guarantees. We propose a variety of approximation algorithms based on two approaches?greedy strategy and relaxation-and-rounding approach. We present an empirical performance analysis of the proposed algorithms through simulations in practical networks. Our results suggest that our two algorithms show a performance trade-off between coverage and running time and are therefore suitable for different kinds of deployment. "Mitigating the Effects of Software Component Shifts for Incremental Reprogramming of Wireless Sensor Networks," Rajesh Krishna Panta and Saurabh Bagchi. IEEE Transactions on Parallel and Distributed Systems (TPDS), pp. 1882-1894, vol. 23, issue 10, October 2012. [ Abstract ] Wireless reprogramming of sensor nodes is an essential requirement for long-lived networks because software functionality needs to be changed over time. The amount of information that needs to be wirelessly transmitted during reprogramming should be minimized to reduce reprogramming time and energy. In this paper, we present a multihop incremental reprogramming system called Hermes that transfers over the network the delta between the old and new software and lets the sensor nodes rebuild the new software using the received delta and the old software. It reduces the delta by using techniques to mitigate the effects of function and global variable shifts caused by the software modifications. Then it compares the binary images at the byte level with a method to create a small delta that needs to be sent over the wireless network to all the nodes. For the wide range of software change scenarios that we experimented with, we find that Hermes transfers up to 201 times less information than Deluge, the standard reprogramming system for TinyOS, and 64 times less than an existing incremental reprogramming system by Jeong and Culler. "Multi-Armed Bandit Congestion Control in Multi-Hop Infrastructure Wireless Mesh Networks,” A. B. M. Alim Al Islam, S. M. Iftekharul Alam, Vijay Raghunathan, and Saurabh Bagchi. At the 20th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 1-10, August 7-9, 2012, Arlington, VA. (Acceptance rate: 49/136 = 36%, our paper was one of 7 accepted papers selected for 10 page publication limit, the others have an 8 page limit) [ Abstract ] Congestion control in multi-hop infrastructure wire- less mesh networks is both an important and a unique problem. It is unique because it has two prominent causes of failed transmissions which are difficult to tease apart - lossy nature of wireless medium and high extent of congestion around gateways in the network. The concurrent presence of these two causes limits applicability of already available congestion control mechanisms, proposed for wireless networks. Prior mechanisms mainly focus on the former cause, ignoring the latter one. Therefore, we address this issue to design an end-to-end congestion control mechanism for infrastructure wireless mesh networks in this paper. We formulate the congestion control problem and map that to the restless multi-armed bandit problem, a well-known decision problem in the literature. Then, we propose three myopic policies to achieve a near-optimal solution for the mapped problem since no optimal solution is known to this problem. We perform comparative evaluation through ns-2 simulation and a real testbed experiment with a wireline TCP variant and a wireless TCP protocol. The evaluation reveals that our proposed mechanism can achieve up to 52% increased network throughput and 34% decreased average energy consumption per transmitted bit in comparison to the other end-to-end congestion control variants. "An Empirical Study of the Robustness of Inter-component Communication in Android," Amiya K. Maji, Fahad A. Arshad, Saurabh Bagchi and Jan S. Rellermeyer (IBM Research, Austin) In the 42th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 12 pages, Boston, MA, June 25-28, 2012 [ Presentation ] [ Abstract ] Over the last three years, Android has established itself as the largest-selling operating system for smartphones. It boasts of a Linux-based robust kernel, a modular framework with multiple components in each application, and a security- conscious design where each application is isolated in its own virtual machine. However, all of these desirable properties would be rendered ineffectual if an application were to deliver erroneous messages to targeted applications and thus cause the target to behave incorrectly. In this paper, we present an empir- ical evaluation of the robustness of Inter-component Commu- nication (ICC) in Android through fuzz testing methodology, whereby, parameters of the inter-component communication are changed to various incorrect values. We show that not only exception handling is a rarity in Android applications, but also it is possible to crash the Android runtime from unprivileged user processes. Based on our observations, we highlight some of the critical design issues in Android ICC and suggest solutions to alleviate these problems. "Aveksha: A Hardware-Software Approach for Non-intrusive Tracing and Profiling of Wireless Embedded Systems,": Matthew Tancreti, Mohammad Hossain, Saurabh Bagchi, and Vijay Raghunathan. In: 9th ACM Conference on Embedded Networked Sensor Systems (SenSys), 14 pages, Seattle, WA, Nov 1-4, 2011. (Winner of the best paper award) (Acceptance rate: 24/123 = 19.5%) [ Presentation ] [ Abstract ] It is important to get an idea of the events occurring in an embedded wireless node when it is deployed in the field, away from the convenience of an interactive debugger. Such visibility can be useful for post-deployment testing, replaybased debugging, and for performance and energy profiling of various software components. Prior software-based solutions to address this problem have incurred high execution overhead and intrusiveness. The intrusiveness changes the intrinsic timing behavior of the application, thereby reducing the fidelity of the collected profile. Prior hardware-based solutions have involved the use of dedicated ASICs or other tightly coupled changes to the embedded node?s processor, which significantly limits their applicability. In this paper, we present AVEKSHA, a hardware-software approach for achieving the above goals in a non-intrusive manner. Our approach is based on the key insight that most embedded processors have an on-chip debug module (which has traditionally been used for interactive debugging) that provides significant visibility into the internal state of the processor. We design a debug board that interfaces with the on-chip debug module of an embedded node?s processor through the JTAG port and provides three modes of event logging and tracing: breakpoint, watchpoint, and program counter polling. Using expressive triggers that the onchip debug module supports, AVEKSHA can watch for, and record, a variety of programmable events of interest. A key feature of AVEKSHA is that the target processor does not have to be stopped during event logging (in the last two of the three modes), subject to a limit on the rate at which logged events occur. AVEKSHA also performs power monitoring of the embedded wireless node and, importantly, enables power consumption data to be correlated to events of interest. AVEKSHA is an operating system-agnostic solution. We demonstrate its functionality and performance using three applications running on the TelosB motes; two in TinyOS and one in Contiki. We show that AVEKSHA can trace tasks and other generic events and can perform energy monitoring at function and task-level granularity. We also describe how we used AVEKSHA to find a subtle bug in the TinyOS low power listening protocol. "Distributed Online Channel Assignment Toward Optimal Monitoring in Multi-Channel Wireless NetworksDong-Hoon Shin, Saurabh Bagchi, and Chih-Chun Wang, At the 31st Annual IEEE International Conference on Computer Communications (INFOCOM) Mini-conference, pp. 2918-2912, 2012. [ Presentation ] [ Abstract ] This paper studies an optimal channel assignment problem for passive monitoring in multi-channel wireless networks, where a set of sniffers capture and analyze the network traffic to monitor the network. The objective of this problem is to maximize the total amount of traffic captured by sniffers by judiciously assigning the radios of sniffers to a set of channels. This problem is NP-hard, with the computational complexity growing exponentially with the number of sniffers. We develop distributed online solutions to this problem for large-scale and dynamic networks. Prior works have attained a constant factor1 - 1/e of the maximum monitoring coverage in a centralized setting. Our algorithm preserves the same ratio while providing a distributed solution that is amenable to online implementation. Also, our algorithm is cost-effective, in terms of communication and computational overheads, due to the use of only local communication and the adaptation to incremental network changes. We present two operational modes of our algorithm for two types of networks that have different rates of network changes. One is a proactive mode for fast varying networks, while the other is a reactive mode for slowly varying networks. Simulation results demonstrate the effectiveness of the two modes of our algorithm. Efficient Incremental Code Update for Sensor Networks,” Rajesh Krishna Panta, Saurabh Bagchi, and Samuel P. Midkiff, ACM Transactions on Sensor Networks, pp. 1-32, Vol. 7, No. 4, February 2011. [ Abstract ] Wireless reprogramming of sensor nodes is an essential requirement for long-lived networks since software functionality needs to be changed over time. During reprogramming, the number of radio transmissions should be minimized, since reprogramming time and energy depend chiefly on the number of radio transmissions. In this article, we present a multihop incremental reprogramming protocol called Zephyr that transfers the delta between old and new software versions, and lets the sensor nodes rebuild the new software using the received delta and the old software. Zephyr reduces the delta size by using application-level modifications to mitigate the effects of function shifts. Then it compares the two binary images at the byte level to generate a small delta, that is then sent over the wireless network to all the nodes. For the wide range of software change cases that we used as benchmarks, Zephyr transfers 1.83 to 1987 times less traffic through the network than Deluge (the standard nonincremental reprogramming protocol for TinyOS) and 1.14 to 49 times less traffic than an existing incremental reprogramming protocol by Jeong and Culler [2004]. Dependence-based Multi-level Tracing and Replay for Wireless Sensor Networks Debugging,” Man Wang, Zhiyuan Li, Feng Li, Xiaobing Feng, Saurabh Bagchi, and Yung-Hsiang Lu, At the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, Tools and Theory for Embedded Systems (LCTES), 10 pages, April 12-14, 2011. (Acceptance rate: 17/48 = 35.4%) [ Abstract ] Due to resource constraints and unreliable communication, wireless sensor network (WSN) programming and debugging remain to be challenging tasks. Deterministic replay is an error diagnosis method which has long been proposed for distributed systems. However, one of the significant hurdles for applying deterministic replay on WSN is posed by the small program memory on typical sensor nodes. This paper proposes a dependence-based multi-level method for memory-efficient tracing and replay. In the interest of portability across different hardware platforms, the method is implemented as a source-level tracing and replaying tool. To further reduce the code size after tracing instrumentation, a cost model is used for making the decision on which functions to in-line. A prototype for the tool targets C programs is developed on top of the Open64 compiler and is tested using several TinyOS applications running on TelosB motes. Preliminary experimental results show that the test programs, which do not fit the program memory after straightforward instrumentation, can be successfully instrumented using the new method such that the injected errors can be found. Secure neighbor discovery through overhearing in static multihop wireless networks," Srikanth Hariharan, Ness B. Shroff, and Saurabh Bagchi, In Elsevier Computer Networks, pp. 1-13, in Preprint, doi:10.1016/j.comnet.2010.10.021. [ Abstract ] In wireless ad-hoc and sensor networks, neighbor discovery is one of the first steps performed by a node upon deployment and disrupting it adversely affects a number of routing, MAC, topology discovery and intrusion detection protocols. It is especially harmful when an adversary can convince nodes that it is a legitimate neighbor, which it can do easily and without the use of cryptographic primitives. In this paper, we develop a secure neighbor discovery protocol, SEDINE, for static multihop wireless networks. We prove that, in the absence of packet losses, without using any centralized trusted node or specialized hardware, SEDINE prevents any node, legitimate or malicious, from being incorrectly added to the neighbor list of another legitimate node that is not within its transmission range. We provide simulation results to demonstrate the efficacy of SEDINE, in the presence of packet losses. Stealthy Attacks in Wireless Ad Hoc Networks: Detection and Countermeasure,” Issa Khalil and Saurabh Bagchi, Accepted for publication in IEEE Transactions on Mobile Computing, pp. 1-15, in Preprint, http://doi.ieeecomputersociety.org/10.1109/TMC.2010.249. [ Abstract ] Stealthy packet dropping is a suite of four attacks --- misrouting, power control, identity delegation and colluding collision --- that can be easily launched against multihop wireless ad hoc networks. Stealthy packet dropping disrupts the packet from reaching the destination through malicious behavior at an intermediate node. However, the malicious node gives the impression to its neighbors that it performs the legitimate forwarding action. Moreover, a legitimate node comes under suspicion. A popular method for detecting attacks in wireless networks is behaviorbased detection performed by normal network nodes through overhearing the communication in their neighborhood. This leverages the open broadcast nature of wireless communication. An instantiation of this technology is local monitoring. We show that local monitoring, and the wider class of overhearing-based detection, cannot detect stealthy packet dropping attacks. Additionally, it mistakenly detects and isolates a legitimate node. We present a protocol called SADEC that can detect and isolate stealthy packet dropping attack efficiently. SADEC presents two techniques that can be overlaid on basic local monitoring: having the neighbors maintain additional information about the routing path, and adding some checking responsibility to each neighbor. Additionally, SADEC provides an innovative mechanism to better utilize local monitoring by considerably increasing the number of nodes in a neighborhood that can do monitoring. We show through analysis and simulation experiments that basic local monitoring fails to efficiently mitigate most of the presented attacks while SADEC successfully mitigates them. Fixed Cost Maintenance for Information Dissemination in Wireless Sensor Networks,” Rajesh Krishna Panta, Madalina Vintila, and Saurabh Bagchi, At the 29th IEEE Symposium on Reliable Distributed Systems (SRDS), pp. 54-63, October 31-November 3, 2010, New Delhi, India. (Acceptance rate: 21/93 = 22.6%)[ Presentation ] [ Abstract ] Because of transient wireless link failures, incremental node deployment, and node mobility, existing information dissemination protocols used in wireless ad-hoc and sensor networks cause nodes to periodically broadcast "advertisement" containing the version of their current data item even in the "steady state" when no dissemination is being done. This is to ensure that all nodes in the network are up-to-date. This causes a continuous energy expenditure during the steady state, which is by far the dominant part of a network?s lifetime. In this paper, we present a protocol called Varuna which incurs a constant energy cost, independent of the duration of the steady state. In Varuna, nodes monitor the traffic pattern of the neighboring nodes to decide when an advertisement is necessary. Using testbed experiments and simulations, we show that Varuna achieves several orders of magnitude energy savings compared to Trickle, the existing standard for dissemination in sensor networks, at the expense of a reasonable amount of memory for state maintenance. RDAS: Reputation-Based Resilient Data Aggregation in Sensor Network": Carlos Perez-Toro, Rajesh Krishna Panta, and Saurabh Bagchi. In the 7th Annual IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks (SECON), 9 pages, June 21-25, 2010, Boston, Massachusetts. (Acceptance rate: 63/274 = 23.0%) [ Presentation ] [ Abstract ] Data aggregation in wireless sensor networks is vulnerable to security attacks and natural failures. A few nodes can drastically alter the result of the aggregation by reporting erroneous data. In this paper we present RDAS, a robust data aggregation protocol that uses a reputation-based approach to identify and isolate malicious nodes in a sensor network. RDAS is based on a hierarchical clustering arrangement of nodes, where a cluster head analyzes data from the cluster nodes to determine the location of an event. It uses the redundancy of multiple nodes sensing an event to determine what data should have been reported by each node. Nodes form part of a distributed reputation system, where they share information about other node?s performance in reporting accurate data and use the reputation ratings to suppress the reports from malicious nodes. RDAS is able to perform accurate data aggregation in the presence of individually malicious and colluding nodes, as well as nodes that try to compromise the integrity of the reputation system by lying about other nodes? behavior. We show that RDAS is more resilient to security attacks with respect to accuracy of event localization than the baseline data aggregation protocol with no security feature. "A Tale of Two Synchronizing Clocks": Jinkyu Koo, Rajesh Krishna Panta, Saurabh Bagchi, and Luis Montestruque. In: 7th ACM Conference on Embedded Networked Sensor Systems (SenSys), November 4-6, 2009, Berkeley, California. (Acceptance rate: 21/119 = 17.6%) [ Presentation ] [ abstract ] A specific application for wastewater monitoring and actuation, called SwimNet, deployed city-wide in a mid-sized US city, posed some challenges to a time synchronization protocol. The nodes in SwimNet have a low duty cycle (2% in current deployment) and use an external clock, called the Real Time Clock (RTC), for triggering the sleep and the wake-up. The RTC has a very low drift (2 ppm) over the wide range of temperature fluctuations that the SwimNet nodes have, while having a low power consumption (0.66 mW). However, these clocks will still have to be synchronized occasionally during the long lifetime of the Swim-Net nodes and this was the problem we confronted with our time synchronization protocol. The RTC to fit within the power and the cost constraints makes the tradeoff of having a coarse time granularity of only 1 second. Therefore, it is not sufficient to synchronize the RTC itself?that would mean a synchronization error of up to 1 second would be possible even with a perfect synchronization protocol. This would be unacceptable for the low duty cycle operation?each node stays awake for only 6 seconds in a 5 minute time window. This was the first of three challenges for time synchronization. The second challenge is that the synchronization has to be extremely fast since ideally the entire network should be synchronized during the 6 second wake-up period. Third, the long range radio used for the metropolitan-scale SwimNet does not make its radio stack software available, as is seen with several other radios for long-range ISM band RF communication. Therefore, a common technique for time synchronization?MAC layer time-stamping?cannot be used. Additionally, MAC layer time-stamping is known to be problematic with high speed radios (even at 250 kbps). We solve these challenges and design a synchronization protocol called HARMONIA. It has three design innovations. First, it uses the finely granular microcontroller clock to achieve synchronization of the RTC, such that the synchronization error, despite the coarse granularity of the RTC, is in the microsecond range. Second, HARMONIA pipelines the synchronization messages through the network resulting in fast synchronization of the entire network. Third, HARMONIA provides failure handling for transient node and link failures such that the network is not overburdened with synchronization messages and the recovery is done locally. We evaluate HARMONIA on SwimNet nodes and compare the two metrics of synchronization error and synchronization speed with FTSP. It performs slightly worse in the former and significantly better in the latter. "Multigrade Security Monitoring for Ad-Hoc Wireless Networks": Matthew Tan Creti, Matthew Beaman, Saurabh Bagchi, Zhiyuan Li, and Yung-Hsiang Lu. In: 6th IEEE International Conference on Mobile Ad-hoc and Sensor Systems (MASS), October 12-15, 2009, Macau SAR, China. (Acceptance rate: 62/245 = 25.3%). [ Presentation ] [ abstract ] Ad-hoc wireless networks are being deployed in critical applications that require protection against sophisticated adversaries. However, wireless routing protocols, such as the widely-used AODV, are often designed with the assumption that nodes are benign. Cryptographic extensions such as Secure AODV (SAODV) protect against some attacks but are still vulnerable to easily-performed attacks using colluding adversaries, such as the wormhole attack. In this paper, we make two contributions to securing routing protocols. First, we present a protocol called Route Verification (RV) that can detect and isolate malicious nodes involved in routingbased attacks with very high likelihood. However, RV is expensive in terms of energy consumption due to its radio communications. To remedy the high energy cost of RV, we make our second contribution. We propose a multigrade monitoring (MGM) approach. The MGM approach employs a previously developed lightweight local monitoring technique to detect any necessary condition for an attack to succeed. However, local monitoring suffers from false positives due to collisions on the wireless channel. When a necessary condition is detected, the heavy-weight RV protocol is triggered. We show through simulation that MGM applied to AODV generally requires little extra energy compared to baseline AODV, under the common case where there is no attack present. It is also more resource-efficient and powerful than SAODV in detecting attacks. Our work, for the first time, lays out the framework of multigrade monitoring, which we believe fundamentally addresses the tension between security and resource consumption in ad-hoc wireless networks. "UnMask: Utilizing Neighbor Monitoring for Attack Mitigation in Multihop Wireless Sensor Networks": Issa Khalil, Saurabh Bagchi, Cristina N.-Rotaru, and Ness Shroff. In Elsevier Ad Hoc Networks Journal, notification of acceptance: June 2009. [ abstract ] Sensor networks enable a wide range of applications in both military and civilian domains. However, the deployment scenarios, the functionality requirements, and the limited capabilities of these networks expose them to a wide-range of attacks against control traffic (such as wormholes, rushing, Sybil attacks, etc) and data traffic (such as selective forwarding). In this paper we propose a framework called UNMASK that mitigates such attacks by detecting, diagnosing, and isolating the malicious nodes. UNMASK uses as a fundamental building block the ability of a node to oversee its neighboring nodes? communication. On top of UNMASK, we build a secure routing protocol, LSR, that provides additional protection against malicious nodes by supporting multiple node-disjoint paths. We analyze the security guarantees of UNMASK and use ns-2 simulations to show its effectiveness against representative control and data attacks. The overhead analysis we present shows that UNMASK is a lightweight protocol appropriate for securing resource constrained sensor networks. "Zephyr: Efficient Incremental Reprogramming of Sensor Nodes using Function Call Indirections and Difference Computation": Rajesh Krishna Panta, Saurabh Bagchi, and Samuel P. Midkiff. In: USENIX Annual Technical Conference (USENIX '09), June 14-19, 2009, pp. 411-424, San Diego, California. (Acceptance rate: 32/191 = 16.8%). [ Presentation ] [ abstract ] Wireless reprogramming of sensor nodes is an essential requirement for long-lived networks since the software functionality changes over time. The amount of information that needs to be wirelessly transmitted during reprogramming should be minimized since reprogramming time and energy depend chiefly on the amount of radio transmissions. In this paper, we present a multihop incremental reprogramming protocol called Zephyr that transfers the delta between the old and the new software and lets the sensor nodes rebuild the new software using the received delta and the old software. It reduces the delta size by using application-level modifications to mitigate the effects of function shifts. Then it compares the binary images at the byte-level with a novel method to create small delta, that is then sent over the wireless network to all the nodes. For a wide range of softwarechange cases that we experimented with, we find that Zephyr transfers 1.83 to 1987 times less traffic through the network than Deluge, the standard reprogramming protocol for TinyOS, and 1.14 to 49 times less than an existing incremental reprogramming protocol by Jeong and Culler. "Optimal Monitoring in Multi-Channel Multi-Radio Wireless Mesh Networks": Dong-Hoon Shin and Saurabh Bagchi. In: 10th ACM International Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc 2009), pp. 229-238, May 18-21, 2009, New Orleans, LA. (Acceptance rate: 31/175 = 17.7%) [ Presentation ] [ abstract ] Wireless mesh networks (WMN) are finding increasing usage in city-wide deployments for providing network connectivity. Mesh routers in WMNs typically use multiple wireless channels to enhance the spatial-reuse of frequency bands, often with multiple radios per node. Due to the cooperative nature of WMNs, they are susceptible to many attacks that cannot be defeated by using traditional cryptographic mechanisms of authentication or encryption alone. A solution approach commonly used for defending against such attacks is behavior-based detection in which some nodes overhear communication in their neighborhood to determine if the behavior by a neighbor is legitimate. It has been proposed to use specialized monitoring nodes deployed strategically throughout the network for performing such detection. The problem that arises is where to deploy these monitoring nodes, how to minimize their number, and which channels to tune their radios to, such that the maximum part of the network can be covered. This problem has been solved for single channel networks by a greedy approximation algorithm since the exact solution is NP-hard. The greedy algorithm achieves the best performance, in terms of the worst case, possible among all polynomial-time algorithms provided that P != NP. In this paper, we solve the problem for multi-channel multi-radio WMNs. The intuitive extension of the greedy algorithm destroys the property of best performance. Instead, we formulate the problem as an integer linear program, solve its linear program relaxation, and then use two rounding techniques that we develop by adapting existing rounding schemes. We thereby present two approximation algorithms. The first, computationally-light algorithm, called probabilistic rounding algorithm gives an expected best performance in the worst case. The second, called deterministic rounding algorithm achieves the best worst-case performance in a deterministic manner. To evaluate how the three algorithms perform in practice, we simulate them in random networks and scale-free networks. "Hermes: Fast and Energy Efficient Incremental Code Updates for Wireless Sensor Networks": Rajesh Krishna Panta, Saurabh Bagchi. In: 28th Annual IEEE Conference on Computer Communications (INFOCOM), pp. 639-647, April 19-25 2009, Rio de Janeiro, Brazil. (Acceptance rate: 282/1435 = 19.7%) [ Presentation ] [ abstract ] Wireless reprogramming of sensor nodes is a requirement for long-lived networks due to changes in the functionality of the software running on the nodes. The amount of information that needs to be wirelessly transmitted during reprogramming should be minimized to reduce reprogramming time and energy. In this paper, we present a multi-hop incremental reprogramming protocol called Hermes that transfers the delta between the old and new software and lets the sensor nodes rebuild the new software using the received delta and the old software. It reduces the delta by using techniques to mitigate the effects of function and global variable shifts caused by the software modifications. Then it compares the binary images at the byte level with a method to create small delta that needs to be sent over the wireless network to all the nodes. For a wide range of software change scenarios that we experimented with, we find that Hermes transfers up to 201 times less information than Deluge, the standard reprogramming protocol for TinyOS and 64 times less than an existing incremental reprogramming protocol by Jeong and Culler. "Efficient Wireless Reprogramming through Reduced Bandwidth Usage and Opportunistic Sleeping": Rajesh Krishna Panta, Saurabh Bagchi and Issa M. Khalil. In Elsevier Ad Hoc Networks Journal, Volume 7, Issue 1, pp. 42-62, January 2009. [ abstract ] Wireless reprogramming of a sensor network is useful for uploading new code or for changing the functionality of existing code. Reprogramming may be done multiple times during a node?s lifetime and therefore a node has to remain receptive to future code updates. Existing reprogramming protocols, including Deluge, achieve this by bundling the reprogramming protocol and the application as one code image which is transferred through the network. The reprogramming protocol being complex, the overall size of the program image that needs to be transferred over the wireless medium increases, thereby increasing the time and energy required for reprogramming a network. We present a protocol called Stream that significantly reduces this bloat by using the facility of having multiple code images on the node. It pre-installs the reprogramming protocol as one image and equips the application program with the ability to listen to new code updates and switch to this image. For a sample application, the increase in size of the application image is 1 page (48 packets of 36 bytes each) for Stream and 11 pages for Deluge. Additionally, we design an opportunistic sleeping scheme whereby nodes can sleep during the period when reprogramming has been initiated but has not yet reached the neighborhood of the node. The savings become significant for large networks and for frequent reprogramming. We implement Stream on Mica2 motes and conduct testbed and simulation experiments to compare delay and energy consumption for different network sizes with respect to the state-of-the-art Deluge protocol. "SeNDORComm: An Energy-Efficient Priority-Driven Communication Layer for Reliable Wireless Sensor Networks": Vinai Sundaram, Saurabh Bagchi, Yung-Hsiang Lu, and Zhiyuan Li. In: 27th International Symposium on Reliable Distributed Systems (SRDS), pp. 23-32, Naples, Italy, October 6-8, 2008. (Acceptance rate: 28/112 = 25%) [ presentation ] [ abstract ] In many reliable Wireless Sensor Network (WSN) applications, messages have different priorities depending on urgency or importance. For example, a message reporting the failure of all nodes in a region is more important than that for a single node. Moreover, traffic can be bursty in nature, such as when a correlated error is reported by multiple nodes running identical code. Current communication layers in WSNs lack efficient support for these two requirements. We present a priority-driven communication layer, called SeNDORComm, which schedules transmission of packets driven by application-specified priority, buffers and packs multiple messages in a packet, and honors the latency guarantee for a message. We show that SeNDORComm improves energy efficiency, message reliability, and network utilization and delays congestion in a network. We extensively evaluate SeNDORComm using analysis, simulation, and testbed experiments. We demonstrate the improvement in goodput of SeNDORComm over the default communication layer, GenericComm in TinyOS (134.78% for a network of 20 nodes). "MISPAR: Mitigating Stealthy Packet Dropping in Locally-Monitored Multi-hop Wireless Ad Hoc Networks": Issa Khalil and Saurabh Bagchi. In: 4th International Conference on Security and Privacy in Communication Networks (SecureComm), 10 pages, Istanbul, Turkey, September 22-25, 2008. (Acceptance rate: 26/124 = 21%) [ abstract ] Local monitoring has been demonstrated as a powerful technique for mitigating security attacks in multi-hop ad-hoc networks. In local monitoring, nodes overhear partial neighborhood communication to detect misbehavior such as packet drop or delay. However, local monitoring as presented in the literature is vulnerable to a class of attacks that we introduce here called stealthy packet dropping. Stealthy packet dropping disrupts the packet from reaching the destination by malicious behavior at an intermediate node. However, the malicious node gives the impression to its neighbors that it performed the legitimate forwarding action. Moreover, a legitimate node comes under suspicion. We introduce four ways of achieving stealthy packet dropping, none of which is currently detectable. We provide a protocol called MISPAR based on local monitoring to remedy each attack. It presents two techniques ? having the neighbors maintain additional information about the routing path, and adding some checking responsibility to each neighbor. We show through analysis and simulation that the basic local monitoring fails to mitigate any of the presented attacks while MISPAR successfully mitigates them. "Single versus Multi-hop Wireless Reprogramming in Sensor Networks": Rajesh Krishna Panta, Issa Khalil, Saurabh Bagchi, Luis Montestruque. In: 4th International Conference on Testbeds and Research Infrastructures for the Development of Networks & Communities (Tridentcom), 7 pages, Innsbruck, Austria, March 18-20, 2008. [ Presentation ] [ abstract ] Wireless reprogramming of the sensor network is useful for uploading new code or for changing the functionality of the existing code. In recent years, the research focus has shifted from single hop reprogramming to multi-hop reprogramming primarily because of its ease of use. Practical experience from a multi-hop sensor network for monitoring water pollution, called CSOnet, deployed in South Bend, IN, indicates that single-hop reprogramming may be preferable under certain conditions to minimize reprogramming time and energy. In this, the user gets close to a node to be reprogrammed and wirelessly reprograms a single node at a time. The choice between single hop and multi-hop reprogramming depends on factors like network size, node density and most importantly, link reliabilities. We present a protocol called DStream having both single and multi-hop reprogramming capabilities. We provide mathematical analysis and results from testbed experiments (including experiments conducted on CSOnet networks) and simulations to give insights into the choice of the two reprogramming methods for various network parameters. "Optimizing AES for Embedded Devices and Wireless Sensor Networks": Shammi Didla, Aaron Ault and Saurabh Bagchi. In: 4th International Conference on Testbeds and Research Infrastructures for the Development of Networks & Communities (Tridentcom), pp. 1-10, Innsbruck, Austria, March 18-20, 2008. [ Presentation ] [ abstract ] The increased need for security in embedded applications in recent years has prompted efforts to develop encryption/decryption algorithms capable of running on resource-constrained systems. The inclusion of the Advanced Encryption Standard (AES) in the IEEE 802.15.4 Zigbee protocol has driven its widespread use in current embedded platforms.We propose an implementation of AES in a high-level language (C in this case) that is the first software-based solution for 16-bit microcontrollers capable of matching the communication rate of 250 kbps specified by the Zigbee protocol, while also minimizing RAM and ROM usage. We discuss a series of optimizations and their effects that lead to our final implementation achieving an encryption speed of 286 kbps, RAM usage of 260 bytes, and code size of 5160 bytes on the Texas Instruments MSP430 microprocessor. We also develop rigorous benchmark experiments to compare other AES implementations on a common platform, and show that our implementation outperforms the best available implementation by 85%. "Energy-efficient, On-demand Reprogramming of Large-Scale Sensor Networks": Mark D. Krasniewski, Rajesh K. Panta, Saurabh Bagchi, Chin-Lung Yang, and William J. Chappell. In: ACM Transactions on Sensor Networks (TOSN), notification of acceptance: June 2007. [ abstract ] As sensor networks operate over long periods of deployment in difficult to reach places, their requirements may change or new code may need to be uploaded to them. The current state of the art protocols (Deluge and MNP) for network reprogramming perform the code dissemination in a multi-hop manner using a three way handshake whereby meta-data is exchanged prior to code exchange to suppress redundant transmissions. The code image is also pipelined through the network at the granularity of pages. In this paper we propose a protocol called Freshet for optimizing the energy for code upload and speeding up the dissemination if multiple sources of code are available. The energy optimization is achieved by equipping each node with limited non-local topology information, which it uses to determine the time when it can go to sleep since code is not being distributed in its vicinity. The protocol to handle multiple sources provides a loose coupling of nodes to a source and disseminates code in waves each originating at a source, with mechanism to handle collisions when the waves meet. The protocol?s performance with respect to reliability, delay, and energy consumed, is demonstrated through analysis, simulation, and implementation on the Berkeley mote platform. "Adaptive Correctness Monitoring for Wireless Sensor Networks Using Hierarchical Distributed Run-Time Invariant Checking": Douglas Herbert, Vinaitheerthan Sundaram, Yung-Hsiang Lu, Saurabh Bagchi, and Zhiyuan Li. In ACM Transactions on Autonomous and Adaptive Systems (TAAS), notification of acceptance: May 2007. [ abstract ] This paper presents a hierarchical approach for detecting faults in wireless sensor networks (WSNs) after they have been deployed. The developers of WSNs can specify ?invariants? that must be satisfied by the WSNs. We present a framework, Hierarchical SEnsor Network Debugging (H-SEND), for lightweight checking of invariants. H-SEND is able to detect a large class of faults in data gathering WSNs and leverages the existing message flow in the network by buffering and piggybacking messages. H-SEND checks as closely to the source of a fault as possible, pinpointing the fault quickly and efficiently in terms of additional network traffic. Therefore, H-SEND is suited to bandwidth or communication energy constrained networks. A specification expression is provided for specifying invariants so that a protocol developer can write behavioral level invariants. We hypothesize that data from sensor nodes does not change dramatically, but rather changes gradually over time. We extend our framework for the invariants that include values determined at run time in order to detect the violation of data trends. The value range can be based on information local to a single node or the surrounding nodes? values. Using our system, developers can write invariants to detect data trends without prior knowledge of correct values. Automatic value detection can be used to detect anomalies that were not previously possible. To demonstrate the benefits of run-time range detection and fault checking, we construct a prototype WSN using CO2 and temperature sensors coupled to Mica2 motes. We show that our method can detect sudden changes of the environments with little overhead in communication, computation, and storage. "LITEWORP: Detection and Isolation of the Wormhole Attack in Static Multihop Wireless Networks": Issa Khalil, Suarabh Bagchi, and Ness B. Shroff. In: Elsevier Computer Networks Journal, notification of acceptance: April 2007. [ abstract ] In multihop wireless systems, such as ad-hoc and sensor networks, the need for cooperation among nodes to relay each other?s packets exposes them to a wide range of security attacks. A particularly devastating attack is known as the wormhole attack, where a malicious node records control and data traffic at one location and tunnels it to a colluding node far away, which replays it locally. This can either disrupt route establishment or make routes pass through the malicious nodes. In this paper, we present a lightweight countermeasure for the wormhole attack, called LITEWORP, which relies on overhearing neighbor communication. LITEWORP is particularly suitable for resource-constrained multihop wireless networks, such as sensor networks. Our solution allows detection of the wormhole, followed by isolation of the malicious nodes. Simulation results show that every wormhole is detected and isolated within a very short period of time over a large range of scenarios. The results also show that the fraction of packets lost due to the wormhole when LITEWORP is applied is negligible compared to the loss in an unprotected network. Simulation results bring out the configuration where no framing is possible, while still having high detection rate. Analysis is done to show the low resource consumption of LITEWORP, the low detection latency, and the likelihood of framing by malicious nodes. "MOBIWORP: Mitigation of the Wormhole Attack in Mobile Multihop Wireless Networks": Issa Khalil, Saurabh Bagchi, and Ness B. Shroff. In: Elsevier's Journal of AdHoc Networks, notification of acceptance: February 2007. [ abstract ] In multihop wireless systems, the need for cooperation among nodes to relay each other's packets exposes them to a wide range of security attacks. A particularly devastating attack is the wormhole attack, where a malicious node records control traffic at one location and tunnels it to a colluding node, possibly far away, which replays it locally. This can have an adverse effect on route establishment by preventing nodes from discovering legitimate routes that are more than two hops away. Previous works on tolerating wormhole attacks have focused only on detection and used specialized hardware, such as directional antennas or extremely accurate clocks. More recent work has addressed the problem of locally isolating the malicious nodes. However, all of this work has been done in the context of static networks due to the difficulty of secure neighbor discovery with mobile nodes. The existing work on secure neighbor discovery has limitations in accuracy, resource requirements, and applicability to ad hoc and sensor networks. In this paper, we present a countermeasure for the wormhole attack, called MOBIWORP, which alleviates these drawbacks and efficiently mitigates the wormhole attack in mobile networks. MOBIWORP uses a secure central authority (CA) for global tracking of node positions. Local monitoring is used to detect and isolate malicious nodes locally. Additionally, when sufficient suspicion builds up at the CA, it enforces a global isolation of the malicious node from the whole network. The effect of MOBIWORP on the data traffic and the fidelity of detection is brought out through extensive simulation using ns-2. The results show that as time progresses, the data packet drop ratio goes to zero with MOBIWORP due the capability of MOBIWORP to detect, diagnose and isolate malicious nodes. With an appropriate choice of design parameters, MOBIWORP is shown to completely eliminate framing of a legitimate node by malicious nodes, at the cost of a slight increase in the drop ratio. The results also show that increasing mobility of the nodes degrades the performance of MOBIWORP. "SLAM: Sleep-Wake Aware Local Monitoring in Sensor Networks", Issa Khalil, Saurabh Bagchi, and Ness B. Shroff. In: IEEE Symposium on Dependable Systems and Networks (DSN), June 25-28, 2007, Edinburgh, Ireland. [ Presentation ] [ abstract ] Sleep-wake protocols are critical in sensor networks to ensure long-lived operation. However, an open problem is how to develop efficient mechanisms that can be incorporated with sleep-wake protocols to ensure both longlived operation and a high degree of security. Our contribution in this paper is to address this problem by using local monitoring, a powerful technique for detecting and mitigating control and data attacks in sensor networks. In local monitoring, each node oversees part of the traffic going in and out of its neighbors to determine if the behavior is suspicious, such as, unusually long delay in forwarding a packet. Here, we present a protocol called SLAM to make local monitoring parsimonious in its energy consumption and to integrate it with any extant sleep-wake protocol in the network. The challenge is to enable sleep-wake in a secure manner even in the face of nodes that may be adversarial and not wake up nodes responsible for monitoring its traffic. We prove analytically that the security coverage is not weakened by the protocol. We perform simulations in ns-2 to demonstrate that the performance of local monitoring is practically unchanged while listening energy saving of 30 to 129 times is achieved, depending on the network load. "Fault Tolerant ARIMA-based Aggregation of Data in Sensor Networks": Doug Herbert, Gaspar Modelo-Howard, Carlos Perez-Toro, Saurabh Bagchi. In: Fast Abstract in the Supplemental Proceedings of the International Conference on Dependable Systems and Networks (DSN), June 25-28, 2007, Edinburgh, Ireland. [ introduction ] Sensor networks collect data using robust, energy efficient, low bandwidth protocols due to the few resources available to them. Among the many methods explored to improve the efficiency of sensor networks for data collection, researchers have proposed statistical methods to predict when each sensor should send data. Two shortcomings to this approach are: (1) the use of a static model, assuming unchanging environmental parameters to the sensor network and (2) the lack of an error detection mechanism to compensate for failures during the transmission of data. Suppressing data transmissions by using the fact that data follows a statistical model has been proposed before, with the assumption that the environmental parameters do not change. This assumption does not fit several scenarios where sensor networks can be deployed. Additionally, recent advancements in sensor networks allow for more powerful (CPU and memory) on sensing nodes. This represents an opportunity to extend the role of end nodes, by allowing them to perform statistical analysis on the collected data at runtime and updating model parameters in real time. In this manner, we can create a self adapting model. "Stream: Low Overhead Wireless Reprogramming for Sensor Networks", Rajesh Krishna Panta, Issa Khalil, and Saurabh Bagchi. 26th Annual IEEE Conference on Computer Communications (INFOCOM), pp. 928-936, May 6-12 2007, Anchorage, Alaska, USA. (Acceptance rate: 252/~1400 = 18%) [ Presentation ] [ abstract ] Wireless reprogramming of a sensor network is useful for uploading new code or for changing the functionality of existing code. Through the process, a node should remain receptive to future code updates because reprogramming may be done multiple times during the node?s lifetime. Existing reprogramming protocols, such as Deluge, achieve this by bundling the reprogramming protocol and the application as one program image, thereby increasing the overall size of the image which is transferred through the network. This increases both time and energy required for network reprogramming. We present a protocol called Stream that mitigates the problem by significantly reducing the size of the program image. Using the facility of having multiple code images on a node and switching between them, Stream pre-installs the reprogramming protocol as one image and the application program equipped with the ability to listen to new code updates as the second image. For a sample application, Stream reduces the size of the program image by 10 pages (48 packets/page) compared to Deluge. Stream is implemented on the Mica2 sensor nodes and we conduct testbed and simulation experiments to show the reduction in energy and reprogramming time of Stream compared to Deluge. "Data-Centric Routing in Sensor Networks: Single-hop Broadcast or Multi-hop Unicast", Xuan Zhong, Ravish Khosla, Gunjan Khanna , Saurabh Bagchi and Edward J. Coyle. IEEE 65th Vehicular Technology Conference (VTC2007-Spring), pp. 150-154, April 22 - 25 2007. (Acceptance rate: 685/1443 ~ 47.4%) [ abstract ] Data dissemination strategies and communication protocols that minimize the use of energy can significantly prolong the lifetime of a sensor network. Data-centric dissemination strategies seek energy efficiency by employing short metadata descriptions in advertisements (ADVs) of the availability of data, short requests (REQs) to obtain the data by nodes that are interested in it, and data transmissions (DATA) to deliver data to the requesting nodes. An important decision in this process is whether the DATA transmission should be made at full power in broadcast mode or at low power in multi-hop unicast mode. The determining factor is shown in this paper to be the fraction of nodes that are interested in the DATA, as shown by the number of REQs that are generated. Closed form expressions for this critical fraction of interested nodes is derived when the nodes have no memory or infinite memory for state information and when transmissions are reliable and not reliable. These results can be used during both the design and operation of the network to increase energy efficiency and network longevity. "Analysis and Evaluation of SECOS, a Protocol for Energy Efficient and Secure Communication in Sensor Networks?, Issa Khalil, Saurabh Bagchi, and Ness B. Shroff. Elsevier Ad-Hoc Networks Journal, Volume 5, Issue 3, pp. 360-391, April 2007. [ abstract ] Wireless sensor networks are increasingly being used in applications where the communication between nodes needs to be protected from eavesdropping and tampering. Such protection is typically provided using techniques from symmetric key cryptography. The protocols in this domain suffer from one or more of the following problems - weak security guarantees if some nodes are compromised, lack of scalability, high energy overhead for key management, and increased end-to-end data latency. In this paper, we propose a protocol called SECOS that mitigates these problems in static sensor networks. SECOS divides the sensor field into control groups each with a control node. Data exchange between nodes within a control group happens through the mediation of the control head which provides the common key. The keys are refreshed periodically and the control nodes are changed periodically to enhance security. SECOS enhances the survivability of the network by handling compromise and failures of control nodes. It provides the guarantee that the communication between any two sensor nodes remains secure despite the compromise of any number of other nodes in the network. The experiments based on a simulation model show a seven time reduction in energy overhead and a 50% reduction in latency compared to SPINS, which is one of the state-of-the-art protocols for key management in sensor networks. "Performance Comparison of SPIN based Push-Pull Protocols", Ravish Khosla, Xuan Zhong, Gunjan Khanna, Saurabh Bagchi,and Edward J. Coyle. IEEE Wireless Communications and Networking Conference (WCNC), Mar 11-15, 2007, Hong Kong. (Acceptance rate: 48%) [ abstract ] Multiple data-centric protocols - which can broadly be classified as push-pull, push-only, or pull-only - have been proposed in the literature. In this paper we present a framework to develop an insight into the characteristics of push-pull protocols. The performance of push-pull protocols is critically dependent on the time-out settings used to trigger failure recovery mechanisms. We perform a study of how to choose optimal timeouts to achieve best performance. Our starting point is a recently proposed SPIN-based protocol, called Shortest-Path Minded SPIN (SPMS), in which meta-data negotiations take place prior to data exchange in order to minimize the number of data transmissions. We propose a redesign of SPMS, called SPMS-Rec, which reduces the energy expended in the event of failures by requiring intermediate relay nodes to try alternate routes. Our simulation results show that SPMS-Rec outperforms SPMS, and thus SPIN, yielding energy savings while reducing the delay when multiple nodes fail along a route. We further propose a modification to SPMS-Rec through request suppression which helps in reducing redundant data transmissions. MOBIWORP: Mitigation of the Wormhole Attack in Mobile Multihop Wireless Networks", Issa Khalil, Saurabh Bagchi, and Ness B. Shroff. Accepted for publication in Elsevier's Journal of Ad Hoc Networks (acceptance notification: February 2007). [ abstract ] In multihop wireless systems, the need for cooperation among nodes to relay each other's packets exposes them to a wide range of security attacks. A particularly devastating attack is the wormhole attack, where a malicious node records control traffic at one location and tunnels it to a colluding node, possibly far away, which replays it locally. This can have an adverse effect on route establishment by preventing nodes from discovering legitimate routes that are more than two hops away. Previous works on tolerating wormhole attacks have focused only on detection and used specialized hardware, such as directional antennas or extremely accurate clocks. More recent work has addressed the problem of locally isolating the malicious nodes. However, all of this work has been done in the context of static networks due to the difficulty of secure neighbor discovery with mobile nodes. The existing work on secure neighbor discovery has limitations in accuracy, resource requirements, and applicability to ad hoc and sensor networks. In this paper, we present a countermeasure for the wormhole attack, called MOBIWORP, which alleviates these drawbacks and efficiently mitigates the wormhole attack in mobile networks. MOBIWORP uses a secure central authority (CA) for global tracking of node positions. Local monitoring is used to detect and isolate malicious nodes locally. Additionally, when sufficient suspicion builds up at the CA, it enforces a global isolation of the malicious node from the whole network. The effect of MOBIWORP on the data traffic and the fidelity of detection is brought out through extensive simulation using ns-2. The results show that as time progresses, the data packet drop ratio goes to zero with MOBIWORP due the capability of MOBIWORP to detect, diagnose and isolate malicious nodes. With an appropriate choice of design parameters, MOBIWORP is shown to completely eliminate framing of a legitimate node by malicious nodes, at the cost of a slight increase in the drop ratio. The results also show that increasing mobility of the nodes degrades the performance of MOBIWORP. "Topology Insensitive Location Determination Using Independent Estimates Through Semi-Directional Antennas" Chin-Lung Yang, Saurabh Bagchi, and William J. Chappell. IEEE Transactions on Antennas and Propagation, Volume: 54 , Issue: 11 , Part 2, pp. 3458 ? 3472, Nov. 2006. [ abstract ] We demonstrate the effect of using multiple estimations from independent single wireless motes in order to decrease network topology dependence on location estimation in a wireless sensor network. A method of determining the location of a target by using multiple compact semi-directional antennas is shown to give an independent estimate of location from each sensor mote in a network, each estimate not relying on the data from neighboring motes as in the case of traditional triangulation. We begin by demonstrating a method of using angular diversity through multiple semi-directional antennas in order to ascertain the location of a target. The estimation of both range and angle is demonstrated in the presence of a noisy and/or faded channel. An efficient and fast algorithm on a wireless sensor mote is presented through a Taylor series expansion of the simulated antenna pattern. Furthermore, using the results from the location estimation from a single node, location determination in a realistic network is explored through both theory and simulation. The results indicate that our proposed algorithm depends significantly less on the topology (spatial arrangement) of the anchor nodes. While network planning for a variety of topologies of anchor nodes is shown to be necessary when using triangulation, our proposed algorithm is insensitive to the deployments of the anchor nodes. A testbed was created in order to experimentally demonstrate that the predictions are accurate even in triangulation-adverse topologies. The experimental testbed shows that a linear arrangement of closely spaced sensors can reduce the location error to one-fourth of the location error using triangulation. "MOBIWORP : Mitigation of the Wormhole Attack in Mobile Multihop Wireless Networks", Issa Khalil, Saurabh Bagchi, and Ness Shroff. International Conference on Security and Privacy in Communication Networks, Aug 28-Sep 1, 2006, Baltimore, Maryland, USA. (Acceptancerate: 32/126 ~ 25.4%) [ Presentation ] [ abstract ] In multihop wireless systems, the need for cooperation among nodes to relay each other's packets exposes them to a wide range of security attacks. A particularly devastating attack is the wormhole attack, where a malicious node records control traffic at one location and tunnels it to a colluding node, possibly far away, which replays it locally. This can have an adverse effect on route establishment by preventing nodes from discovering legitimate routes that are more than two hops away. Previous works on tolerating wormhole attacks have focused only on detection and used specialized hardware, such as directional antennas or extremely accurate clocks. More recent work has addressed the problem of locally isolating the malicious nodes. However, all of this work has been done in the context of static networks due to the difficulty of secure neighbor discovery with mobile nodes. The existing work on secure neighbor discovery has limitations in accuracy, resource requirements, and applicability to ad hoc and sensor networks. In this paper, we present a countermeasure for the wormhole attack, called MOBIWORP, which alleviates these drawbacks and efficiently mitigates the wormhole attack in mobile networks. MOBIWORP uses a secure central authority (CA) for global tracking of node positions. Local monitoring is used to detect and isolate malicious nodes locally. Additionally, when sufficient suspicion builds up at the CA, it enforces a global isolation of the malicious node from the whole network. The effect of MOBIWORP on the data traffic and the fidelity of detection is brought out through extensive simulation using ns-2. "Detection and Repair of Software Errors in Hierarchical Sensor Networks", Douglas Herbert, Yung-Hsiang Lu, Saurabh Bagchi, and Zhiyuan Li. IEEE International Conference on Sensor Networks, Ubiquitous, and Trustworthy Computing (SUTC2006), June 5-7, 2006, Taichung, Taiwan. (Runner-up for best paper) (Acceptance rate: 50/210 ~ 23.8%) [ abstract ] Sensor networks are being increasingly deployed for collecting critical data in various applications. Once deployed, a sensor network may experience faults at the individual node level or at an aggregate network level due to design errors in the protocol, implementation errors, or deployment conditions that are significantly different from the target environment. In many applications, the deployed system may fail to collect data in an accurate, complete, and timely manner due to such errors. If the network produces incorrect data, the resulting decisions on the data may be incorrect, and negatively impact the application. Hence, it is important to detect and diagnose these faults through run-time observation. Existing technologies face difficulty with wireless sensor networks due to the large scale of the networks, the resource constraints of bandwidth and energy on the sensing nodes, and the unreliability of the observation channels for recording the behavior. This paper presents a semi-automatic approach named H-SEND (Hierarchical SEnsor Network Debugging) to observe the health of a sensor network and to remotely repair errors by reprogramming through the wireless network. In H-SEND, a programmer specifies correctness properties of the protocol (?invariants?). These invariants are associated with conditions (the ?observed variables?) of individual nodes or the network. The compiler automatically inserts checking code to ensure that the observed variables satisfy the invariants. The checking can be done locally or remotely, depending on the nature of the invariant. In the latter case, messages are generated automatically. If an error is detected at run-time, the logs of the observed variables are examined to analyze and correct the error. After errors are corrected, new programs or patches can be uploaded to the nodes through the wireless network. We construct a prototype to demonstrate the benefit of run-time detection and correction. "DICAS: Detection, Diagnosis and Isolation of Control Attacks in Sensor Networks", Issa Khalil, Saurabh Bagchi, and Cristina Nina-Rotaru. IEEE Conference on Security and Privacy for Emerging Areas in Communication Networks (SecureComm), September 5 - 9, 2005, Athens, Greece. (Acceptance rate: 32/163 ~ 19.6%). [ abstract ] Sensor networks enable a wide range of applications in both military and civilian domains. However, the deployment scenarios, the functionality requirements, and the limited capabilities of these networks expose them to a wide-range of attacks against control traffic (such as wormholes, Sybil attacks, rushing attacks, etc). In this paper we propose a lightweight protocol called DICAS that mitigates these attacks by detecting, diagnosing, and isolating the malicious nodes. DICAS uses as a fundamental building block the ability of a node to oversee its neighboring nodes? communication. On top of DICAS, we build a secure routing protocol, LSR, which also produces multiple node-disjoint paths. We analyze the security guarantees of DICAS and use ns-2 simulations to show its effectiveness against three representative attacks. Overhead analysis is conducted to prove the lightweight nature of DICAS. "Reliable middleware for sensor networks", Mark Daniel Krasniewski. Masters Thesis, major professor: Saurabh Bagchi. Submitted: August 2005. [ Presentation ] [ abstract ] As sensor networks operate over long periods of time deployed in inaccessible places, each sensor?s requirements, location, and reliability may change as they each are subjected to unpredictable influences, both external and internal to the network. As access to these sensors will usually be limited, it is important that when sensor requirements change that there is a reliable, efficient, and fast means of propagating code updates over the network. This work provides a protocol for quickly disseminating data over a sensor network with high reliability and intelligent power management. It achieves energy savings through forwarding of local information on the network scale, providing nodes an estimation of distance from code updates. This energy saving mechanism is further enhanced through directional antenna hardware, which accurately estimates node locations given a few nodes already aware of their locations. This hardware scheme shows much greater accuracy than triangulation with omni-directional antennas. Coupled with reprogramming the network and locating sensor nodes is the challenge of locating and isolating nodes within the network that function improperly despite software-level updates. To solve this problem this work implements a trust-based protocol that aggregates individual sensor decisions and performance over time to mask out nodes of questionable reliability. Nodes that show continued poor performance are either ignored or removed from the system; the locations of these faulty nodes are easily obtained through directional hardware localization. This scheme proves robust even in cases where half of the network is functioning unreliably. "TIBFIT: Trust Index Based Fault Tolerance for Arbitrary Data Faults in Sensor Networks", Mark Krasniewski, Padma Varadharajan, Bryan Rabeler, Saurabh Bagchi, and Y. Charlie Hu. International Conference on Dependable Systems and Networks (DSN), June 28 - July 1, 2005, Yokohama, Japan. (Acceptance rate: PDS track 24/115 ~ 20.9%) [ Camera ready ] [ Presentation ] [ abstract ] Since sensor data gathering is the primary functionality of sensor networks, it is important to provide a fault tolerant method for reasoning about sensed events in the face of arbitrary failures of nodes sending in the event reports. In this paper, we propose a protocol called TIBFIT to diagnose and mask arbitrary node failures in an event-driven wireless sensor network. In our system model, sensor nodes are organized into clusters with rotating cluster heads. The nodes, including the cluster head, can fail in an arbitrary manner generating missed event reports, false reports, or wrong location reports. Correct nodes are also allowed to make occasional natural errors. Each node is assigned a trust index to indicate its track record in reporting past events correctly. The cluster head analyzes the event reports using the trust index and makes event decisions. TIBFIT is analyzed and simulated using the network simulator ns-2 and its coverage evaluated with a varying number and varying intelligence of the malicious nodes. We show that once TIBFIT gathers enough system state, accurate event detection is possible even if more than 50% of the network nodes are compromised. "LITEWORP: A Lightweight Countermeasure for the Wormhole Attack in Multihop Wireless Networks", Issa Khalil, Saurabh Bagchi, and Ness B. Shroff. International Conference on Dependable Systems and Networks (DSN), June 28 - July 1, 2005, Yokohama, Japan. (Acceptance rate: PDS track 24/115 ~ 20.9%) [ Camera ready ] [ Presentation ] [ abstract ] In multihop wireless systems, such as ad-hoc and sensor networks, the need for cooperation among nodes to relay each other?s packets exposes them to a wide range of security attacks. A particularly devastating attack is known as the wormhole attack, where a malicious node records control and data traffic at one location and tunnels it to a colluding node, which replays it locally. This can have an adverse effect in route establishment by preventing nodes from discovering routes that are more than two hops away. In this paper, we present a lightweight countermeasure for the wormhole attack, called LITEWORP, which does not require specialized hardware. LITEWORP is particularly suitable for resource-constrained multihop wireless networks, such as sensor networks. In this paper, we present a detailed description of LITEWORP for static networks, and discuss extensions to mobile networks. Our solution allows detection of the wormhole, followed by isolation of the malicious nodes. Simulation results show that every wormhole is detected and isolated within a very short period of time over a large range of scenarios. The results also show that the fraction of packets lost due to the wormhole when LITEWORP is applied is negligible compared to the loss encountered when the method is not applied. "Location Estimation in Ad-Hoc Networks with Directional Antenna", Nipoon Malhotra, Mark Krasniewski, Chin-Lung Yang, Saurabh Bagchi, and William Chappell. 25th International Conference on Distributed Computing Systems (ICDCS), June 6-9, 2005, Columbus, Ohio, USA. (Acceptance rate:<14% of 540) [ Presentation ] [ abstract ] With the development of location aware sensor applications, location determination has become an increasingly important middleware technology. Numerous current technologies for location determination of sensor nodes use the received signal strength from sensor nodes using omni-directional antennas. However, an increasing number of sensor systems are now deploying directional antennas due to their advantages like energy conservation and better bandwidth utilization. In this paper, we present techniques for location determination in a sensor network with directional antennas under different kinds of deployment of the nodes. We show how the location estimation problem can be solved by measuring the received signal strength from just one or two anchors in a 2D plane with directional antennas. We implement our technique using Berkeley MICA2 sensor motes and show that it is up to three times more accurate than triangulation using omni-directional antennas. We also perform Matlab simulations that show the accuracy of location determination with increasing node density. "Location Tracking with Directional Antennas in Wireless Sensor Networks", Chin-Lung Yang, Saurabh Bagchi, and William J. Chappell. IEEE MTT-S International Microwave Symposium (IMS), June 11-17, 2005, Long Beach, California, USA. [ abstract ] In this paper, we investigate the use of multiple directional antennas on sensor motes for location determination and mobile node monitoring. One key aspect that distinguishes wireless sensor networks is inexpensive transmitters and receivers that still maintain acceptable connectivity. Therefore, complex RF solutions are often not applicable. We propose and demonstrate a location estimation algorithm on a single sensor node equipped with inexpensive directional antennas by measuring the received signal strength of the transmission peers. This algorithm is further applied to the dynamic tracking of a wandering mote. The location tracking error can be reduced from 30% to 16% by using moving average schemes and merging estimates from different sets of antennas. The mean error of tracking estimates can be obtained to provide the certainty of location tracking. Therefore, only a single mote with angular diverse multiple antennas is needed to determine the location of a mote without triangulation. "Efficient Collection of Sensor Data in Remote Fields Using Mobile Collectors", Yuldi Tirta, Zhiyuan Li, Yung-Hsiang Lu, and Saurabh Bagchi. In Proceedings of the 13th International Conference on Computer Communications and Networks (ICCCN 2004), October 2004. (Acceptance rate: 81/200 ~ 40.5%) [ abstract ] This paper proposes using a mobile collector, such as an airplane or a vehicle, to collect sensor data from remote fields. We present three different schedules for the collector: Round-Robin, Rate-Based, and Min Movement. The data are not immediately transmitted to the base station after being sensed but buffered at a cluster head; hence, it is important to ensure the latency is within an acceptable range. We compare the latency and the energy expended of the three schedules. We use the ns-2 network simulator to study the scenarios and illustrate conditions under which Rate-Based outperforms Round-Robin in latency, and vice-versa. The benefit of Min Movement is in minimizing the energy expended. "Robust Location Determination in Wireless Ad-hoc Networks", Nipoon Malhotra. Masters Thesis submitted in partial fulfillment of the MSECE degree. Supervisor: Prof. Saurabh Bagchi. Submitted: August 2004. [ Presentation ] [ Presentation ] [ abstract ] With the development of location aware applications for ad-hoc and sensor networks, location determination has become an important middleware technology. Cost, size, and energy constraints permit only a fraction of the network nodes, called anchors, to carry location determination hardware such as GPS receivers. The first part of our work involves evaluating the effect of topological characteristics, such as coverage and connectivity, on the accuracy of location determination protocols. We study a protocol called Hop-Terrain that uses the received signal strength from anchors for estimating position of sensor nodes. The resultant performance predictions can be used in motion planning for nodes to improve the accuracy of location estimates. An emerging trend in sensor networks is deploying directional RF antennas on nodes due to advantages like energy conservation and better bandwidth utilization. The location estimation techniques for omni-directional antennas cannot be employed in such systems since the received power at a node depends on additional variables like transmission and receiving angles. Our work presents techniques for location determination with directional antennas under different kinds of node deployments, such as globally aligned nodes and unaligned nodes. We show how the problem can be solved in a two-dimensional plane by using just one anchor in contrast to three anchors for omni-directional antennas. We consider the possibility of errors in individual distance measurements and present theoretical as well as simulation-based results for the level of redundancy required to have an aggregate estimation error below a desired threshold. Simulations show improvement in the accuracy of the position estimates over that of the triangulation based approach for omni-directional antennas. "Fault Tolerant Energy Aware Data Dissemination Protocol in Sensor Network", Gunjan Khanna, Saurabh Bagchi, and Yu-Sung Wu. IEEE Dependable Systems and Networks Conference (DSN 2004), June 28-July 1, 2004, Florence, Italy. (Acceptance rate: PDS track 25/101 ~ 24.8%). [ Camera ready ] [ Presentation ] [ abstract ] In this paper we present a data dissemination protocol for efficiently distributing data through a sensor network in the face of node and link failures. Our work is motivated by the SPIN protocol which uses metadata negotiation to minimize data transmissions. We propose a protocol called Shortest Path Minded SPIN (SPMS) in which every node has a zone defined by its maximum transmission radius. A node which is a data source advertises the availability of data to all the nodes in its zone using a metadata descriptor. Any interested node requests the data and gets sent the data using multi-hop communication via the shortest path. The failure of any node in the path is detected and recovered using backup routes. We build simulation models to compare SPMS against SPIN. Different scenarios including mobility and node failures are simulated. The simulation results show that SPMS reduces the delay over 10 times and consumes 30% less energy in the static failure free scenario. Even with the addition of mobility, SPMS outperforms SPIN by energy gains between 5% and 21%. An analytical model is also constructed to compare the two protocols under a simplified topology. "Controlled Mobility for Efficient Data Gathering in Sensor Networks with Passively Mobile Nodes", Yuldi Tirta, Bennett Lau, Nipoon Malhotra, Saurabh Bagchi, Zhiyuan Li, and Yung-Hsiang Lu. Sensor Network Operations, Shashi Phoha (ed.), IEEE Press, Wiley Publications, 2004. (Acceptance rate: 30/90 ~ 33.3%) [ introduction ] Sensor networks are a particular class of wireless ad hoc networks in which the nodes have small components for sensors, actuators and RF communication components. Sensor nodes are dispersed over the area of interest, called sensor field, and are capable of short-range RF communication (about 100 ft) and contain signal processing engines to manage the communication protocols and for data processing. The individual nodes have a limited processing capacity, but are capable of supporting distributed applications through coordinated effort in a network that can include hundreds or even thousands of nodes. Sensor nodes are typically battery-powered. Since replacing batteries is often very difficult, reducing energy consumption is an important design consideration for sensor networks. Because transmitting power is proportional to the square or quadruple of the transmission range, the range of a sensor node is constrained in most deployments. In sensor networks, the final data gathering and analysis station, called the base station, is sometimes placed far from the sensing nodes. This may be because the sensing field is in a hazardous environment, such as enemy territory, or a physically harsh environment, with high temperature, etc. It may be impossible to locate a protected, high-end base station with large computational and communication capabilities in such a hazardous environment. Since the transmission range of the individual nodes is limited, the large separation between the sensing region and the base station implies long distance and multi-hop transmission has to occur. This architecture is difficult to deploy and maintain with regular sensing nodes acting as relay nodes. Some nodes in the network may be mobile, either in a controlled or in an uncontrolled manner. Uncontrolled mobility, referred to as passive mobility implies the nodes move but not of their own volition. Examples include nodes embedded into animals, nodes carried by human beings who move according to other considerations, and light-weight nodes carried by physical processes such as running water, glaciers, or wind. Nodes with passive mobility make data collection more challenging. Some nodes may move far away from the data aggregation point, such as a base station or a cluster head, and even become disconnected from the rest of the network. With mobility, routing data to the nodes may become inefficient since many sensor network routing protocols rely on position knowledge. Hence, traditionally, mobility has been looked on as an adversary to efficient deployment of sensor networks due to degradation of the topology affecting parameters such as connectivity and coverage or due to increased failure rates of wireless links. We turn this argument on its head and propose to use ?controlled mobility? of certain nodes in the network to our advantage. Controlled mobility implies the ability to control the movement of an entity (direction, velocity, etc.) by sending it control commands. "Analysis and Evaluation of Topological and Application Characteristics of Unreliable Mobile Wireless Ad-hoc Network", Serdar Cabuk, Nipoon Malhotra, Longbi Lin, Saurabh Bagchi, and Ness Shroff. 10th IEEE Pacific Rim Dependable Computing Conference (PRDC ' 04), March 2004. (Acceptance rate: 34/102 ~ 33.3%) [ Camera ready ] [ Presentation ] [ abstract ] This paper presents a study of topological characteristics of mobile wireless ad-hoc networks. The characteristics studied are connectivity, coverage, and diameter. Knowledge of topological characteristics of a network aids in the design and performance prediction of network protocols. This paper introduces intelligent goal-directed mobility algorithms for achieving desired topological characteristics and presents a simulation-based study. The study shows that to achieve low, medium and high network QoS defined in terms of combined requirements of the three metrics, the network needs respectively 8, 16, and 40 nodes. If nodes can fail, the requirements increase to 8, 36 and 60 nodes respectively. We present a theoretical derivation of the improvement due to the mobility models and the sufficient condition for connectivity and coverage in the network. Next, we show the effect of improved topological characteristics in enhancing QoS of an application level protocol, namely, a location determination protocol called Hop-Terrain. The study shows that the error in location estimation is reduced by up to 68% with the goal-directed mobility. "SECOS: Key Management for Scalable and Energy Efficient Crypto On Sensors", Issa Khalil and Saurabh Bagchi. CERIAS Tech Report TR-2003-33. [ abstract ] Wireless sensor networks are becoming a critical computational infrastructure, in which the communication between nodes needs to be protected from eavesdropping and tampering. Symmetric key cryptography is the fundamental technique being used. The protocols in this domain suffer from one or more of the problems of weak security guarantees if some nodes are compromised, lack of scalability, high energy overhead for key management and increased end-to-end data latency. In this paper, we propose a protocol called SECOS that mitigates these problems. SECOS divides the sensor field into control groups each with a control head. Data exchange between nodes within a control group happens through the mediation of the control head which provides the common key. The keys and the control heads are changed periodically to enhance security. SECOS enhances the survivability of the network by handling failures of control nodes. The experiments based on a simulation model show 7 times reduction in energy overhead and 50% reduction in latency compared to the stateof-the-art protocol, SPINS. We also provide an analytical derivation of the optimal control group size that operates under the resource constraints and minimizes energy consumption.

Distributed Intrusion Tolerant Systems
“A Risk Assessment Tool for Advanced Metering Infrastructures,” Tawfeeq Shawly, Jun Liu, Nathan Burow, Saurabh Bagchi, Robin Berthier (University of Illinois at Urbana-Champaign), and Rakesh B. Bobba (University of Illinois at Urbana-Champaign). Accepted to appear at the 5th IEEE International Conference on Smart Grid Communications (SmartGridComm), pp. 1-6, November 3-6, 2014. (Acceptance rate: 168/398 = 42.2%; in the Security and Privacy track, 41%) [ Abstract ] The growing reliance on Cyber technology for Smart Grid and other Cyber-Physical Systems (CPS) applications increases high assurance challenges. Advanced Metering Infrastructure (AMI) is a critical part of smart grid and failure to address security risks in it might disrupt utilities in their mission to deliver power reliably. The objective of this paper is to find mitigation actions for attacks that might be launched against AMI. The paper introduces a tool called SecAMI that calculates the relationship between the speed at which an attack spreads, the speed at which it can be detected, and the consequences on the availability of the AMI. The results from SecAMI can be used as a performance metric to significantly improve the development and the deployment of Intrusion Detection Systems (IDSs) in an AMI environment. Experimental results with an IDS called Amilyzer show that centralized IDS might not work efficiently in scalable systems comparing with distributed IDS in term of detection time, especially with a Distributed Denial of Service (DDoS) attack or remote disconnects on the AMI. [Presentation] “pSigene: Webcrawling to Generalize SQL Injection Signatures,” Gaspar Modelo-Howard, Fahad A. Arshad, Christopher Gutierrez, Saurabh Bagchi, and Alan Qi. At the 44th Annual IEEE/IFIP International Symposium on Dependable Systems and Networks (DSN), pp. 45-56, June 23 - 26, 2014. (Acceptance rate: 56/185 = 30.3%) [ Abstract ] Intrusion detection systems (IDS) are an important component to effectively protect computer systems. Misuse detection is the most popular approach to detect intrusions, using a library of signatures to find attacks. The accuracy of the signatures is paramount for an effective IDS, still today’s practitioners rely on manual techniques to improve and update those signatures. We present a system, called pSigene, for the automatic generation of intrusion signatures by mining the vast amount of public data available on attacks. It follows a fourstep process to generate the signatures, by first crawling attack samples from multiple public cybersecurity web portals. Then, a feature set is created from existing detection signatures to model the samples, which are then grouped using a biclustering algorithm which also gives the distinctive features of each cluster. Finally the system automatically creates a set of signatures using regular expressions, one for each cluster. We tested our architecture for SQL injection attacks and found our signatures to have a True and False Positive Rates of 90.52% and 0.03%, respectively and compared our findings to other SQL injection signature sets from popular IDS and web application firewalls. Results show our system to be very competitive to existing signature sets. [ Presentation ] “Privatus: Wallet-Friendly Privacy Protection for Smart Meters,” Jinkyu Koo, Xiaojun Lin, and Saurabh Bagchi, At the 17th European Symposium on Research in Computer Security (ESORICS), pp. 1-18, September 10-4, 2012, Pisa, Italy. (Acceptance rate: 50/248 = 20.2%) [ Presentation ] [ Abstract ] Debugging large-scale parallel applications is challenging. Most existing techniques provide mechanisms for process control, but they provide little information about the causes of failures. Most debuggers also scale poorly despite contin- ued growth in supercomputer core counts. We have devel- oped a novel, highly scalable tool to help developers under- stand and fix correctness problems at scale. Our tool proba- bilistically infers the least progressed task in MPI programs using Markov models of execution history and dependence analysis. This analysis guides program slicing to find code that may have caused a failure. In a blind study, we demon- strate that our tool can isolate the root cause of a partic- ularly perplexing bug encountered at scale in a molecular dynamics simulation. Further, we perform fault injections into two benchmark codes and measure the scalability of the tool. Our results show that it accurately detects the least pro- gressed task in most cases and can perform the diagnosis in a fraction of a second with thousands of tasks. "v-CAPS: A Confidentiality and Anonymity Preserving Routing Protocol for Content-Based Publish-Subscribe Networks," Amiya Maji and Saurabh Bagchi, At the 7th International ICST Conference on Security and Privacy in Communication Networks (SecureComm), 20 pages (LNCS format), London, UK, Sep 7-9, 2011. (Acceptance rate: 23/95 = 24.2%) [ Abstract ] Content-based Publish-Subscribe (CBPS) is a widely used communication paradigm where publishers "publish" messages and a set of subscribers receive these messages based on their interests through filtering and routing by an intermediate set of brokers. CBPS has proven to be suitable for many-to-many communication offering flexibility and efficiency in communications between a dynamic set of publishers and subscribers. We are interested in using CBPS in healthcare settings to disseminate health-related information (drug interactions, diagnostic information on diseases) to large numbers of subscribers in a confidentiality-preserving manner. Confidentiality in CBPS requires that the message be hidden from brokers whereas the brokers need the message to compute routing decisions. Previous approaches to achieve these conflicting goals suffer from significant shortcomings ? misrouting, lesser expressivity of subscriber interests, high execution time, and high message overhead. Oursolution, titled v-CAPS, achieves the competing goals while avoiding the previous problems. In v-CAPS, the trusted publishers extract the routing information based on the message and the brokers keep minimal information needed to perform local routing. The routing information is cryptographically secured so that curious brokers or other subscribers cannot learn about the recipients. Our experiments show that v-CAPS has comparable end-to-end message latency to a baseline insecure CBPS system with unencrypted routing vectors. However, the cost of hiding the routing vectors from the brokers is significantly higher. [ Presentation ] "Secure Configuration of Intrusion Detection Sensors for Changing Enterprise Systems," Gaspar Modelo-Howard, Jevin Sweval, and Saurabh Bagchi, At the 7th International ICST Conference on Security and Privacy in Communication Networks (SecureComm), 20 pages (LNCS format), London, UK, Sep 7-9, 2011. (Acceptance rate: 23/95 = 24.2%) [ Abstract ] [ Presentation ] Current attacks to distributed systems involve multiple steps, due to attackers usually taking multiple actions to achieve their goals. Such attacks are called multi-stage attacks and have the ultimate goal to compromise a critical asset for the victim. An example would be compromising a web server, then achieve a series of intermediary steps (such as compromising a developer?s box thanks to a vulnerable PHP module and connecting to a FTP server with gained credentials) to ultimately connect to a database where user credentials are stored. Current detection systems are not capable of analyzing the multi-step attack scenario. In this document we present a distributed detection framework based on a probabilistic reasoning engine that communicates to detection sensors and can achieve two goals: (1) protect the critical asset by detecting multi-stage attacks and (2) tune sensors according to the changing environment of the distributed system monitored by the distributed framework. As shown in the experiments, the framework reduces the number of false positives that it would otherwise report if it were only considering alerts from a single detector and the reconfiguration of sensors allows the framework to detect attacks that take advantage of the changing system environment. "Spam Detection in Voice-over-IP Calls through Semi-Supervised Clustering": Yu-Sung Wu, Saurabh Bagchi, Navjot Singh, and Ratsameetip Wita. At the 39th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 307-316, June 29-July 2, 2009, Lisbon, Portugal. (Acceptance rate: 63/260 = 24.2%) [ Presentation ] [ abstract ] In this paper, we present an approach for detection of spam calls over IP telephony called SPIT in VoIP systems. SPIT detection is different from spam detection in email in that the process has to be soft real-time, fewer features are available for examination due to the difficulty of mining voice traffic at runtime, and similarity in signaling traffic between legitimate and malicious callers. Our approach differs from existing work in its adaptability to new environments without the need for laborious and errorprone manual parameter configuration. We use clustering based on the call parameters, using optional user feedback for some calls, which they mark as SPIT or non-SPIT. We improve on a popular algorithm for semi-supervised learning, called MPCK-Means, to make it scalable to a large number of calls and operate at runtime. Our evaluation on captured call traces shows a fifteen fold reduction in computation time, with improvement in detection accuracy. "Covert TCP/IP Timing Channels: Theory to Implementation": Sarah Sellke, Chih-Chun Wang, Saurabh Bagchi, Ness Shroff. 28th Annual IEEE Conference on Computer Communications (INFOCOM), pp. 2204-2212, April 19-25 2009, Rio de Janeiro, Brazil. (Acceptance rate: 282/1435 = 19.7%). [ Presentation ] [ abstract ] There has been significant recent interest in covert communication using timing channels. In network timing channels, information is leaked by controlling the time between transmissions of consecutive packets. Our work focuses on network timing channels and provides two main contributions. The first is to quantify the threat posed by covert network timing channels. The other is to use timing channels to communicate at a low data rate without being detected. In this paper, we design and implement a covert TCP/IP timing channel. We are able to quantify the achievable data rate (or leak rate) of such a covert channel. Moreover, we show that by sacrificing data rate, the traffic patterns of the covert timing channel can be made computationally indistinguishable from that of normal traffic, which makes detecting such communication virtually impossible. We demonstrate the efficacy of our solution by showing significant performance gains in terms of both data rate and covertness over the state-of-the-art. "Intrusion Detection in Voice-over-IP Environments": Yu-Sung Wu, Vinita Apte, Saurabh Bagchi, Sachin Garg, Navjot Singh. Elsevier International Journal of Information Security (IJIS). [ abstract ] In this article, we present the design of an intrusion detection system for VoIP networks. The first part of our work consists of a simple single-component intrusion detection system called SCIDIVE. In the second part, we extend the design of SCIDIVE and build a distributed and correlation-based intrusion detection system called SPACEDIVE. We create several attack scenarios and evaluate the accuracy and efficiency of the system in the face of these attacks. To the best of our knowledge, this is the first comprehensive look at the problem of intrusion detection in VoIP systems. It includes treatment of the challenges faced due to the distributed nature of the system, the nature of the VoIP traffic, and the specific kinds of attacks at such systems. "Search for Efficiency in Automated Intrusion Response for Distributed Applications": Yu-Sung Wu, Gaspar Modelo-Howard, Bingrui Foo, Saurabh Bagchi, Eugene Spafford. At the 27th International Symposium on Reliable Distributed Systems (SRDS), pp. 53-62, Naples, Italy, October 6-8, 2008. (Acceptance rate: 28/112 = 25%) [ presentation ] [ abstract ] Providing automated responses to security incidents in a distributed computing environment has been an important area of research. This is due to the inherent complexity of such systems that makes it difficult to eliminate all vulnerabilities before deployment and costly to rely on humans for responding to incidents in real time. Earlier works have shed the light on automated responses. They pick the best local response that stops an attack propagation from its current step to the next step. Here we propose a new approach where the optimality of responses is considered from a global point of view on “What’s the eventual outcome on the whole system from using a response?”. We formalize the process of providing automated responses and the criterion for asserting global optimality of the set of deployed responses. We show that reaching the globally optimal solution is an NP-hard problem. Therefore we design a genetic algorithm framework for searching for good solutions. In real world, good solutions can change as the problem structure changes. Here the problem structure involves the protected target system and the attacks, both of which can change over time. Our framework constantly adapts itself to the changing environment based on short term history and also tracks the patterns of attacks in a long-term history. We demonstrate the solution on a distributed e-commerce application called Pet Store with injection of real attacks and show that it improves the survivability of the system over previous works. "Determining Placement of Intrusion Detectors for a Distributed Application through Bayesian Network Modeling": Gaspar Modelo-Howard, Saurabh Bagchi, Guy Lebanon. At the 11th International Symposium on Recent Advances in Intrusion Detection (RAID), pp. 271-290, Boston, MA, September 15-17, 2008. (Acceptance rate: 20/80 = 25%) [ presentation ] [ abstract ] To secure today’s computer systems, it is critical to have different intrusion detection sensors embedded in them. The complexity of distributed computer systems makes it difficult to determine the appropriate configuration of these detectors, i.e., their choice and placement. In this paper, we describe a method to evaluate the effect of the detector configuration on the accuracy and precision of determining security goals in the system. For this, we develop a Bayesian network model for the distributed system, from an attack graph representation of multi-stage attacks in the system. We use Bayesian inference to solve the problem of determining the likelihood that an attack goal has been achieved, given a certain set of detector alerts. We quantify the overall detection performance in the system for different detector settings, namely, choice and placement of the detectors, their quality, and levels of uncertainty of adversarial behavior. These observations lead us to a greedy algorithm for determining the optimal detector settings in a large-scale distributed system. We present the results of experiments on Bayesian networks representing two real distributed systems and real attacks on them. "Intrusion Response Systems: A Survey": Bingrui Foo, Matthew W. Glause, Gaspar Modelo-Howard, Yu-Sung Wu, Saurabh Bagchi, and Eugene Spafford. Book chapter in "Information Assurance: Dependability and Security in Networked Systems", , pp. 377-416, Morgan Kaufmann Publishers. Publication date: Fall 2007. [ abstract ] Protecting networks from computer security attacks is an important concern of computer security. Within this, intrusion prevention and intrusion detection systems have been the subject of much study and have been covered in several excellent survey papers. However, the actions that need to follow the steps of prevention and detection, namely response, have received less attention from researchers or practitioners. It was traditionally thought of as an offline process, with humans in the loop, such as system administrators performing forensics by going through the system logs and determining which services or components need to be recovered. Our systems today have reached a level of complexity and the attacks directed at them a level of sophistication that manual responses are no longer adequate. So far there has been limited work in autonomous intrusion response systems, especially work that provides rigorous analysis or generalizable system building techniques. The work that exists has not been surveyed previously. In this survey paper, we lay out the design challenges in building autonomous intrusion response systems. Then we provide a classification of existing work on the topic into four categories – response through static decision tables, response through dynamic decision process, intrusion tolerance through diverse replicas, and intrusion response for specific classes of attacks. We describe the existing intrusion response systems after classifying them. We present methods for benchmarking the intrusion response systems. We end with ideas for further work in the field. "Capacity Bounds on Timing Channels with Bounded Service Times": Sarah H. Sellke, Chih-Chun Wang, Ness Shroff, and Saurabh Bagchi. In the IEEE International Symposium on Information Theory, pp. 981-985, Nice, France, June 24-29, 2007.[ abstract ] It is well known that queues with exponentially distributed service times have the smallest Shannon capacity among all single-server queues with the same service rate. In this paper, we study the capacity of timing channels in which the service time distributions have bounded support, i.e., Bounded Service Timing Channels (BSTC). We derive an upper bound and two lower bounds on the capacity of such timing channels. The tightness of these bounds is investigated analytically as well as via simulations. We find that the uniform BSTC serves a role for BSTCs that is similar to what the exponential service timing channel does for the case of timing channels with unbounded service time distributions. That is, when the length of the support interval is small, the uniform BSTC has the smallest capacity among all BSTCs. "Automated Adaptive Intrusion Containment in Systems of Interacting Services", Yu-Sung Wu, Bingrui Foo, Yu-ChunMao, Saurabh Bagchi, Eugene Spafford. Elsevier Journal of Computer Networks, Volume 51, Issue 5, pp. 1334-1360, April 2007. [ abstract ] Large scale distributed systems typically have interactions among different services that create an avenue for propagation of a failure from one service to another. The failures being considered may be the result of natural failures or malicious activity, collectively called disruptions. To make these systems tolerant to failures it is necessary to contain the spread of the occurrence automatically once it is detected. The objective is to allow certain parts of the system to continue to provide partial functionality in the system in the face of failures. Real world situations impose several constraints on the design of such a disruption tolerant system of which we consider the following – the alarms may have type I or type II errors; it may not be possible to change the service itself even though the interaction may be changed; attacks may use steps that are not anticipated a priori; and there may be bursts of concurrent alarms. We present the design and implementation of a system named ADEPTS as the realization of such a disruption tolerant system. ADEPTS uses a directed graph representation to model the spread of the failure through the system, presents algorithms for determining appropriate responses and monitoring their effectiveness, and quantifies the effect of disruptions through a high level survivability metric. ADEPTS is demonstrated on a real e-commerce testbed with actual attack patterns injected into it. "Timing Channel Capacity for Uniform and Gaussian Servers", Sarah Sellke, Ness B. Shroff, Saurabh Bagchi, and Chih-Chun Wang. Forty-Fourth Annual Allerton Conference On Communication, Control, and Computing, 4 pages, Sep 27-29, 2006, Allerton, IL, USA. [ Presentation ] [ abstract ] It is well known that queues with exponentially distributed service time have the smallest Shannon capacity among all single-server queues with the same service rate. However, the capacity of other service disiplines such as Gaussian service timing channels (GSTC) remains unknown. In this paper, we outline a preliminary investigation into timing channel capacity, when the service distributions are uniform, Gaussian, and truncated Gaussian. Our study indicates that the capacity per service time increases as the ratio of the standard deviation to the mean service time of the server decreases. "SPACEDIVE: A Distributed Intrusion Detection System for Voice-over-IP Environments", Vinita Apte, Yu-Sung Wu, Saurabh Bagchi, Sachin Garg, and Navjot Singh. Fast Abstract in the Supplemental Proceedings of the International Conference on Dependable Systems and Networks (DSN), pp. 222-223, June 25-28, 2006, Philadelphia, Pennsylvania, USA. [ Presentation ] [ Presentation ] [ abstract ] Voice over IP (VoIP) systems are gaining in popularity as the technology for transmitting voice traffic over IP networks. Along with the anticipated widespread adoption of VoIP systems comes the possibility of security attacks targeted against such systems. The attacks can be thought of as a combination of traditional kinds of security attacks against IP networks and novel attacks enabled by the architecture of VoIP systems. VoIP applications have soft real time requirements. Second, the attacks can span multiple protocols between different end points and may be spread over arbitrary time periods. Considering a range of attack scenarios seen in practice, we observe that the attack symptom is often detectable only by correlating information from multiple sources. The correlation is required among information from multiple protocols at multiple end points. The correlation may need to be done from sources that are peers, such as, two communicating clients or across peer levels, such as, the communicating clients and the servers. We propose the design of a system called SPACEDIVE to serve as a correlation-based IDS for VoIP systems. The Snort IDS [2] is well known for its efficiency in examining incoming packets and SPACEDIVE leverages the Snort functionality. To achieve good performance, SPACEDIVE is built into Snort using part of its low-level functionality (examining and processing packets) and adding to it (e.g., to build state to support stateful detection) and building completely the high level functionality specific to the VoIP environment. "Resource Failure Prediction in Fine-Grained Cycle Sharing Systems", Xiaojuan Ren, Seyong Lee, Rudolf Eigenmann, and Saurabh Bagchi. IEEE International Symposium on High Performance Distributed Computing, June 19-23, 2006, Paris, France. (Runner-up for best paper) [ Presentation ] [ abstract ] Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amount of computational resources available on the Internet. In FGCS, host computers allow guest jobs to utilize the CPU cycles if the jobs do not significantly impact the local users of a host. A characteristic of such resources is that they are generally provided voluntarily and their availability fluctuates highly. Guest jobs may incur resource failures because of unexpected resource unavailability. To provide fault tolerance to guest jobs without adding significant computational overhead, failure prediction is required. This paper presents a method to predict resource failures in FGCS systems. It applies a semi-Markov Process and is based on a novel failure model, combining generic hardware-software failures with domain-specific failures in FGCS. We describe the failure prediction framework and its implementation in a production FGCS system named iShare. Through the experiments on an iShare testbed, we demonstrate that the prediction achieves accuracy above 86% on average and outperforms linear time series models, while the computational cost is negligible. Our experimental results also show that the prediction is robust in the presence of irregular resource failures. "ADEPTS: Adaptive Intrusion Response using Attack Graphs in an E-Commerce Environment", Bingrui Foo , Yu-Sung Wu, Yu-Chun Mao, Saurabh Bagchi, and Eugene Spafford. International Conference on Dependable Systems and Networks (DSN), June 28- July 1, 2005, Yokohama, Japan. (Acceptance rate: DCCS track 54/204 ~ 26.8%) [ Camera ready ] [ Presentation ] [ abstract ] Distributed systems with multiple interacting services, especially e-commerce systems, are suitable targets for malicious attacks because of the potential financial impact. Compared to intrusion detection, automated response has received relatively less attention. In this paper, we present the design of automated response mechanisms in an intrusion tolerant system called ADEPTS. Our focus is on enforcing containment in the system, thus localizing the intrusion and allowing the system to provide service, albeit degraded. ADEPTS uses a graph of intrusion goals, called IGRAPH, as the underlying representation in the system. In response to alerts from an intrusion detection framework, ADEPTS executes algorithms to determine the spread of the intrusion and the appropriate responses to deploy. A feedback mechanism evaluates the success of a deployed response and uses that in guiding future choices. ADEPTS is demonstrated on a distributed e-commerce system and evaluated using a survivability metric. "Modeling and Automated Containment of Worms", Sarah Sellke, Ness B. Shroff, and Saurabh Bagchi. International Conference on Dependable Systems and Networks (DSN), June 28 - July 1, 2005, Yokohama, Japan. (Acceptance rate: DCCS track 54/204 ~ 26.8%) [ Camera ready ] [ Presentation ] [ abstract ] Self-propagating codes, called worms, such as Code Red, Nimda, and Slammer, have drawn significant attention due to their enormous adverse impact on the Internet. There is a great interest in the research community in modeling the spread of worms and in providing adequate defense mechanisms against them. In this paper, we present a (stochastic) branching process model for characterizing the propagation of Internet worms. This model leads to the development of an automatic worm containment strategy that prevents the spread of worms beyond its early stages. Specifically, using the branching process model, we are able to (1) provide a precise condition that determines whether the worm will eventually die out and (2) provdide the probability that the total number of hosts that the worm infects will be below a certain level. We use these insights to develop a simple automatic worm containment scheme, which is demonstrated, through simulations and real trace data, to be both effective and non-intrusive. "SCIDIVE: A Stateful and Cross Protocol Intrusion Detection Architecture for Voice-over-IP Environments", Saurabh Bagchi, Yu-Sung Wu (Purdue U. , USA); Sachin Garg, Navjot Singh, Tim Tsai (Avaya Labs, USA). IEEE Dependable Systems and Networks Conference (DSN 2004), June28-July 1, 2004, Florence, Italy. (Acceptance rate: DCCS track 58/276 ~ 21%). [ Camera ready ] [ Presentation ] [ abstract ] Voice over IP (VoIP) systems are gaining in popularity as the technology for transmitting voice traffic over IP networks. As the popularity of VoIP systems increases, they are being subjected to different kinds of intrusions some of which are specific to such systems and some of which follow a general pattern. VoIP systems pose several new challenges to Intrusion Detection System (IDS) designers. First, these systems employ multiple protocols for call management (e.g., SIP) and data delivery (e.g., RTP). Second, the systems are distributed in nature and employ distributed clients, servers and proxies. Third, the attacks to such systems span a large class, from denial of service to billing fraud attacks. Finally, the systems are heterogeneous and typically under several different administrative domains. In this paper, we propose the design of an intrusion detection system targeted to VoIP systems, called SCIDIVE (pronounced “Skydive”). SCIDIVE is structured to detect different classes of intrusions, including, masquerading, denial of service, and media stream-based attacks. It can operate with both classes of protocols that compose VoIP systems – call management protocols (CMP), e.g., SIP, and media delivery protocols (MDP), e.g., RTP. SCIDIVE proposes two abstractions for VoIP IDS -- Stateful detection and Cross-protocol detection. Stateful detection denotes assembling state from multiple packets and using the aggregated state in the rule matching engine. Cross protocol detection denotes matching rules that span multiple protocols. SCIDIVE is demonstrated on a sample VoIP system that comprises SIP clients and SIP proxy servers with RTP as the data delivery protocol. Four attack scenarios are created and the accuracy and the efficiency of the system evaluated with rules meant to catch these attacks. "ADEPTS: Adaptive Intrusion Containment in Distributed Service Environments", Bingrui Foo, Yu-Sung Wu, Saurabh Bagchi, Gene Spafford, and Blake Matheny. CERIAS Tech Report 2004. [ abstract ] Distributed systems with multiple interacting services, such as distributed e-commerce systems, are suitable targets for malicious attacks because of the potential financial impact. Intrusion detection in such systems has been an active area of research, while the problem of containment has received relatively less attention. Containment seeks to localize the effect of the intrusion to some parts of the system while allowing the other parts to continue to provide service. In this paper, we present the design and implementation of an Adaptive Intrusion Tolerant System, ADEPTS, for automatically containing intrusions in a distributed system. ADEPTS uses a directed acyclic graph of intrusion goals, called I-DAG, and a graph of service interactions, called SNet, as the underlying representations in the system. The containment action in ADEPTS initially has the goal of preventing the spread of the intrusion by modifying its path of escalation in the I-DAG. Failing that, it adopts a more drastic response of modifying the interactions of the services in the SNet. There is also a feedback mechanism for the effectiveness of a deployed response and uses that in guiding future choices. ADEPTS is demonstrated on a distributed ecommerce system and evaluated using a survivability metric whose value depends on the operational services in the face of an intrusion. "ADEPTS: Adaptive Intrusion Containment and Response using Attack Graphs in an E-Commerce Environment", Yu-Sung Wu, Bingrui Foo, Blake Matheny, Tyler Olsen, and Saurabh Bagchi. CERIAS Tech Report 2003-33. [ Presentation ] [ abstract ] Distributed e-commerce systems are suitable targets for malicious attacks because of the potential financial impact. Intrusion detection in such systems has been an active area of research. Once an intrusion is detected, it is important to contain the effect of the intrusion to some parts of the system while allowing the other parts to continue to provide service. It is also important to take preventive or reactive response to reduce the likelihood of the system being compromised through a future attack. In this paper, we present the design and implementation of an Adaptive Intrusion Tolerant System, ADEPTS, for automatically containing and responding to intrusions in a distributed e-commerce system. We use a directed acyclic graph (DAG) of intrusion goals as the underlying representation in the system. In an I-DAG, the nodes are sub-goals of an attack and to reach a particular node, goals corresponding to its child nodes have to be achieved first. We assume an intrusion detection framework which provides alerts to ADEPTS. In response, a parallel algorithm is executed to compute the likelihood that one or more goals in the DAG have been achieved. Next, a response measure computation algorithm is executed to determine the appropriate response action. There is also a feedback mechanism which estimates the success or failure of a deployed response and uses that in adjusting the system weights to guide future choices. ADEPTS is implemented on a distributed e-commerce system that comprises services including, web server, application server, database server, directory server. Alerts are simulated corresponding to different attack types, the algorithms executed and response actions deployed. The experiments bring out the latency of the infrastructure, and the effectiveness in dealing with failed responses through escalation compared to statically mapped Intrusion Response Systems (IRS). "Collaborative Intrusion Detection System (CIDS): A Framework for Accurate and Efficient IDS", Yu-Sung Wu, Bingrui Foo, Yongguo Mei, and Saurabh Bagchi. 19th Annual Computer Security Applications Conference, December 8-12, 2003, Las Vegas , Nevada . (Acceptance rate: 36/110 ~ 32.7% ). [ Presentation ] [ abstract ] In this paper, we present the design and implementation of a Collaborative Intrusion Detection System (CIDS) for accurate and efficient intrusion detection in a distributed system. CIDS employs multiple specialized detectors at the different layers – network, kernel and application – and a manager based framework for aggregating the alarms from the different detectors to provide a combined alarm for an intrusion. The premise is that a carefully designed and configured CIDS can increase the accuracy of detection compared to individual detectors, without a substantial degradation in performance. In order to validate the premise, we present the design and implementation of a CIDS which employs Snort, Libsafe, and a new kernel level IDS called Sysmon. The manager has a graph-based and a Bayesian network based aggregation method for combining the alarms to finally come up with a decision about the intrusion. The system is evaluated using a web-based electronic store front application and under three different classes of attacks – buffer overflow, flooding and script-based attacks. The results show performance degradations compared to no detection of 3.9% and 6.3% under normal workload and a buffer overflow attack respectively. The experiments to evaluate the accuracy of the system show that the normal workload generates false alarms for Snort and the elementary detectors produce missed alarms. CIDS does not flag the false alarm and reduces the incidence of missed alarms to 1 of the 7 cases. CIDS can also be used to measure the propagation time of an intrusion which is useful in choosing an appropriate response strategy

PhD Theses
Bowen Zhou, "Techniques for Detecting Scalability Bugs". Defended: July 18, 2014: [ Abstract ] [ Thesis ] [ Presentation ] Fahad Ali Arshad, "Failure Characterization and Error Detection in Distributed Applications". Defended: April 23, 2014: [ Abstract ] [ Thesis ] [ Presentation ] Gaspar Modelo Howard, "Secure Configuration of Intrusion Detection Sensors for Changing Enterprise Systems;". Defended: December 11, 2012: [ Abstract ] [ Thesis ] [ Presentation ] Tanzima Zerin, "Reliable and Scalable Checkpointing Systems for Distributed Computing Environments". Defended: April 8, 2013: [ Abstract ] [ Thesis ] [ Presentation ] Ignacio Laguna, "Probabilistic Fault Detection and Diagnosis in Large-Scale Distributed Applications". Defended: November 27, 2012: [ Abstract ] [ Thesis ] [ Presentation ] Donghoon Shin, "Algorithms for Distributed Monitoring in Multi-Channel Ad Hoc Wireless Networks". Defended: July 19, 2012: [ Abstract ] [ Thesis ] [ Presentation ] Jinkyu Koo, "Secure Control Protocols for Resource-Constrained Embedded Systems". Defended: June 14, 2012: [ Abstract ] [ Thesis ] [ Presentation ] Rajesh Panta, "Remote Reprogramming of Wireless Sensor Networks". Defended: May 7, 2010: [ Abstract ] [ Thesis ] [ Presentation ] Sarah Sellke, "Analytical Characterization of Internet Security Attacks". Defended: Jan 7, 2010: [ Abstract ] [ Thesis] [ Presentation ] Yu-Sung Wu, "Achieving High Survivability in Distributed Systems through Automated Response". Defended: April 13, 2009. [ Abstract ] [ Thesis ] [ Presentation ] Gunjan Khanna, "Non Intrusive Detection and Diagnosis of Failures in High Throughput Distributed Systems". Defended: June 11, 2007. [ Abstract ] [ Thesis ] [ Presentation ] Issa Khalil ,"Mitigation of Control and Data Traffic Attacks in Wireless Ad-Hoc and Sensor Networks". Defended: December 14, 2006. [ Abstract ] [ Thesis ]

Miscellaneous
"Effects of Types of Active Learning Activity on Two Junior-Level Computer Engineering Courses": Saurabh Bagchi, Mark C. Johnson, and Somali Chaterji. Accepted to appear as a full paper, 6 pages, 38th Annual Frontiers in Education (FIE) Conference, Saratoga Springs, New York, October 25-28, 2008. "Impact of Research Technologies on Service Learning": Saurabh Bagchi, Carla B. Zoltowski, and William C. Oakes. Work In Progress, 2 pages, 38th Annual Frontiers in Education (FIE) Conference, Saratoga Springs, New York, October 25-28, 2008.

Copyright notice: Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the appropriate publisher (IEEE, ACM, Elsevier, etc.)