Dependable Computing Systems Laboratory


Home	Projects	Publications	Presentations	People	News	Activities	About DCSL	Internal

	By Year	By Project

2015
"Optimizing Defensive Investments in Energy-Based Cyber-Physical Systems, Paul Wood, Saurabh Bagchi, and Alefiya Hussain. Accepted to appear at the Dependable Parallel, Distributed and Network-Centric Systems (DPDNS) Workshop, to be held with 29th IEEE International Parallel & Distributed Processing Symposium (IPDPS), pp. 1-10, May 25-29, 2015, Hyderabad, India. [Presentation ] [ Abstract ] Interdependent cyber-physical systems (CPS) connect physical resources between independently motivated actors who seek to maximize profits while providing physical services to consumers. Cyberattacks in seemingly distant parts of these systems have local consequences, and techniques are needed to analyze and optimize defensive costs in the face of increasing cyber threats. This paper presents a technique for transforming physical interconnections between independent actors into a dependency analysis that can be applied to find optimal defensive investment strategies to protect assets from financially motivated adversaries in electric power grids. "TARDIS: Software-Only System-Level Record and Replay in Wireless Sensor Networks, "Matthew Tancreti, Vinaitheerthan Sundaram, Saurabh Bagchi, and Patrick Eugster. Accepted to appear at the 14th ACM/IEEE Conference on Information Processing in Sensor Networks (IPSN), pp. 1-12, April 13-17, 2015, Seattle, WA. (Acceptance rate: 27/111 = 24.3%) [Presentation ] [ Abstract ] Wireless sensor networks (WSNs) are plagued by the possibility of bugs manifesting only at deployment. However, debugging deployed WSNs is challenging for several reasons— the remote location of deployed sensor nodes, the non-determinism of execution that can make it difficult to replicate a buggy run, and the limited hardware resources available on a node. In particular, existing solutions to record and replay debugging in WSNs fail to capture the complete code execution, thus negating the possibility of a faithful replay and causing a large class of bugs to go unnoticed. In short, record and replay logs a trace of predefined events while a deployed application is executing, enabling replaying of events later using debugging tools. Existing recording methods fail due to the many sources of non-determinism and the scarcity of resources on nodes. In this paper we introduce Trace And Replay Debugging In Sensornets (Tardis), a software-only approach for deterministic record and replay of WSN nodes. Tardis is able to record all sources of non-determinism, based on the observation that such information is compressible using a combination of techniques specialized for respective sources. Despite their domain-specific nature, the techniques presented are applicable to the broader class of resource-constrained embedded systems. We empirically demonstrate the viability of our approach and its effectiveness in diagnosing a newly discovered bug in a widely used routing protocol.
2014
"Reliable and Efficient Distributed Checkpointing System for Grid Environments," Tanzima Zerin Islam, Saurabh Bagchi, and Rudolf Eigenmann. Journal of Grid Computing, December 2014, Volume 12, Issue 4, pp 593-613 [ Abstract ] In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion times. Checkpoint-recovery is an attractive mechanism for recovering from such “failures”. Today’s FGCS systems often use expensive, high-performance dedicated checkpoint servers. However, in geographically distributed clusters, this may incur high checkpoint transfer latencies. In this paper we present a distributed checkpointing system called FALCON that uses available disk resources of the FGCS machines as shared checkpoint repositories. However, an unavailable storage host may lead to loss of checkpoint data. Therefore, we model the failures of a storage host and develop a prediction algorithm for choosing reliable checkpoint repositories. We experiment with FALCON in the university-wide Condor testbed at Purdue and show improved and consistent performance for guest jobs in the presence of irregular resource availability. "Addressing failures in exascale computing," Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak et al. International Journal of High Performance Computing Applications (2014): doi:10.1177/1094342014522573, pp. 1-45 [ Abstract ] We present here a report produced by a workshop on ‘Addressing failures in exascale computing’ held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach. The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia, and their interests ranged from theory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions. "Mitigating Interference in Cloud Services by Middleware Reconfiguration," Amiya Maji, Subrata Mitra, Bowen Zhou, Saurabh Bagchi and Akshat Verma (IBM Research). Accepted to appear at the 15th ACM/IFIP/USENIX Middleware conference, pp. 1-12, Nov 16-21, 2014. (Acceptance rate: 27/144 = 18.8%) [ Presentation ] [ Abstract ] Application performance has been and remains one of top five concerns since the inception of cloud computing. A primary determinant of application performance is multi-tenancy or sharing of hardware resources in clouds. While some hardware resources can be partitioned well among VMs (such as CPUs), many others cannot (such as memory bandwidth). In this paper, we focus on understanding the variability in application performance on a cloud and explore ways for an end customer to deal with it. Based on rigorous experiments using CloudSuite, a popular Web2.0 benchmark, running on EC2, we found that interference-induced performance degradation is a reality. On a private cloud testbed, we also observed that interference impacts the choice of best configuration values for applications and middleware. We posit that intelligent reconfiguration of application parameters presents a way for an end customer to reduce the impact of interference. However, tuning the application to deal with interference is challenging because of two fundamental reasons --- the configuration depends on the nature and degree of interference and there are inter-parameter dependencies. We design and implement the IC2 system to address the challenges of detection and mitigation of performance interference in clouds. Compared to an interference-agnostic configuration, the proposed solution provides upto 29% and 40% improvement in average response time on EC2 and a private cloud testbed respectively. "A Risk Assessment Tool for Advanced Metering Infrastructures," Tawfeeq Shawly, Jun Liu, Nathan Burow, Saurabh Bagchi, Robin Berthier (University of Illinois at Urbana-Champaign), and Rakesh B. Bobba (University of Illinois at Urbana-Champaign). Accepted to appear at the 5th IEEE International Conference on Smart Grid Communications (SmartGridComm), pp. 1-6, November 3-6, 2014. (Acceptance rate: 168/398 = 42.2%; in the Security and Privacy track, 41%) [ Presentation ] [ Abstract ] The growing reliance on Cyber technology for Smart Grid and other Cyber-Physical Systems (CPS) applications increases high assurance challenges. Advanced Metering Infrastructure (AMI) is a critical part of smart grid and failure to address security risks in it might disrupt utilities in their mission to deliver power reliably. The objective of this paper is to find mitigation actions for attacks that might be launched against AMI. The paper introduces a tool called SecAMI that calculates the relationship between the speed at which an attack spreads, the speed at which it can be detected, and the consequences on the availability of the AMI. The results from SecAMI can be used as a performance metric to significantly improve the development and the deployment of Intrusion Detection Systems (IDSs) in an AMI environment. Experimental results with an IDS called Amilyzer show that centralized IDS might not work efficiently in scalable systems comparing with distributed IDS in term of detection time, especially with a Distributed Denial of Service (DDoS) attack or remote disconnects on the AMI. "Orion: Scaling Genomic Sequence Matching with Fine-Grained Parallelization," Kanak Mahadik, Somali Chaterji, Bowen Zhou, Milind Kulkarni, and Saurabh Bagchi. Accepted to appear at the International Conference for High Performance Computing, Networking, Storage, and (Supercomputing), pp. 1-11, Nov 16-21, 2014. (Acceptance rate: 82/394 = 20.8%) [ Presentation ] [ Abstract ] Gene sequencing instruments are producing huge volumes of data, straining the capabilities of current database searching algorithms and hindering efforts of researchers analyzing larger collections of data to obtain greater insights. In the space of parallel genomic sequence search, most of the popular softwares, like mpiBLAST, use the database segmentation approach, wherein the entire database is sharded and searched on different nodes. However this approach does not scale well with the increasing length of individual query sequences as well as the rapid growth in size of sequence databases. In this paper, we propose a fine-grained parallelism technique, called Orion, that divides the input query into an adaptive number of fragments and shards the database. Our technique achieves higher parallelism (and hence speedup) and load balancing, while maintaining 100% accuracy. We show that it is 12.3X faster than mpiBLAST for solving a relevant comparative genomics problem. "Is Your Web Server Suffering from Undue Stress due to Duplicate Requests?," Fahad A. Arshad, Amiya K. Maji, Sidharth Mudgal, and Saurabh Bagchi. As a Short Paper, At the 11th International Conference on Autonomic Computing (ICAC), pp. 105-111, June 18-20, 2014, Philadelphia, PA. (Acceptance rate: 12 (full papers) + 10 (short papers)/53 = 41.5%) [ Presentation ] [ Abstract ] An important, if not very well known, problem that afflicts many web servers is duplicate client browser requests due to server-side problems. A legitimate request is followed by a redundant request, thus increasing the load on the server and corrupting state at the server end (such as, the hit count for the page) and at the client end (such as, state maintained through a cookie). This problem has been reported in many developer blogs and has been found to afflict even popular web sites, such as CNN and YouTube. However, to date, there has not been a scientific, technical solution to this problem that is browser vendor neutral. In this paper, we provide such a solution which we call GRIFFIN. We identify that the two root causes of the problem are missing resource at the server end or duplicated Javascripts embedded in the page. We have the insight that dynamic tracing of the function call sequence creates a signature that can be used to differentiate between legitimate and duplicate requests. We apply our technique to find unreported problems in a large production scientific collaboration web service called HUBzero, which are fixed upon reporting the problems. Our experiments show an average overhead of 1.29X for tracing the PHP-runtime on HUBzero across 60 unique HTTP transactions. GRIFFIN has zero false-positives (when run across HTTP transaction of size one and two) and an average detection accuracy of 78% across 60 HTTP transactions. "Diagnosis of Performance Faults in Large Scale MPI Applications via Probabilistic Progress-Dependence Inference," Ignacio Laguna (LLNL), Dong Ahn (LLNL), Bronis de Supinski (LLNL), Saurabh Bagchi, and Todd Gamblin (LLNL), Accepted to appear in IEEE Transactions on Parallel and Distributed Systems (TPDS), pp. 1-15, notification of acceptance: March 2014. [ Presentation ] [ Abstract ] Debugging large-scale parallel applications is challenging. Most existing techniques provide little information about failure root causes. Further, most debuggers significantly slow down program execution, and run sluggishly with massively parallel applications. This paper presents a novel technique that scalably infers the tasks in a parallel program on which a failure occurred, as well as the code in which it originated. Our technique combines scalable runtime analysis with static analysis to determine the least-progressed task(s) and to identify the code lines at which the failure arose. We present a novel algorithm that infers probabilistically progress dependence among MPI tasks using a globally constructed Markov model that represents tasks' control-flow behavior. In comparison to previous work, our algorithm infers more precisely the least-progressed task. We combine this technique with static backward slicing analysis, further isolating the code responsible for the current state. A blind study demonstrates that our technique isolates the root cause of a concurrency bug in a molecular dynamics simulation, which only manifests itself at 7,996 tasks or more. We extensively evaluate fault coverage of our technique via fault injections in ten HPC benchmarks and show that our analysis takes less than a few seconds on thousands of parallel tasks. “pSigene: Webcrawling to Generalize SQL Injection Signatures,” Gaspar Modelo-Howard, Fahad A. Arshad, Christopher Gutierrez, Saurabh Bagchi, and Alan Qi. At the 44th Annual IEEE/IFIP International Symposium on Dependable Systems and Networks (DSN), pp. 45-56, June 23 - 26, 2014 (Acceptance rate: 56/185 = 30.3%) Intrusion detection systems (IDS) are an important component to effectively protect computer systems. Misuse detection is the most popular approach to detect intrusions, using a library of signatures to find attacks. The accuracy of the signatures is paramount for an effective IDS, still today’s practitioners rely on manual techniques to improve and update those signatures. We present a system, called pSigene, for the automatic generation of intrusion signatures by mining the vast amount of public data available on attacks. It follows a fourstep process to generate the signatures, by first crawling attack samples from multiple public cybersecurity web portals. Then, a feature set is created from existing detection signatures to model the samples, which are then grouped using a biclustering algorithm which also gives the distinctive features of each cluster. Finally the system automatically creates a set of signatures using regular expressions, one for each cluster. We tested our architecture for SQL injection attacks and found our signatures to have a True and False Positive Rates of 90.52% and 0.03%, respectively and compared our findings to other SQL injection signature sets from popular IDS and web application firewalls. Results show our system to be very competitive to existing signature sets. [ Abstract ] [ Presentation ] "Accurate Application Progress Analysis for Large-Scale Parallel Debugging," Subrata Mitra, Ignacio Laguna, Dong H. Ahn, Saurabh Bagchi, Martin Schulz, and Todd Gamblin. At the ACM International Symposium on Programming Language Design and Implementation (PLDI), pp. 193-203, Edinburgh, UK, June 9-11, 2014. (Acceptance rate: 52/287 = 18.1%) [ Abstract ] Debugging large-scale parallel applications is challenging. In most HPC applications, parallel tasks progress in a coordinated fashion, and thus a fault in one task can propagate quickly to other tasks, making it difficult to debug. Fortunately, finding the leastprogressed tasks can significantly reduce effort to identify the task where the fault originated. However, existing approaches to detecting them suffer low accuracy and large overheads; either they use imprecise static analysis or are unable to infer progress dependence inside loops. We present a loop-aware progress-dependence analysis tool, called PRODOMETER, which determines relative progress among parallel tasks using only dynamic analysis. Our fault-injection experiments suggest that its accuracy and precision are over 90% for most cases and that it scales well up to 16,384 MPI tasks. Further, our case study shows that it significantly helped diagnosing a perplexing error in MPI, which only manifested at large scale. [ Presentation ]
2013
“Characterizing Configuration Problems in Java EE Application Servers: An Empirical Study with GlassFish and JBoss,” Fahad A. Arshad, Rebecca J. Krause, and Saurabh Bagchi, At the 24th IEEE International Symposium on Software Reliability Engineering (ISSRE), pp. 1-10, Pasadena, CA, November 4-7, 2013. (Acceptance rate: 46/131 = 35.1%)[ Abstract ] We present a characterization study on configuration problems for Java EE application servers. Our study analyzes a total of 281 bug-reports in two phases: a longer (phase- 1) and a shorter (phase-2) phase, from bug tracking systems of two popular open source servers, GlassFish and JBoss. We study configuration problems in four orthogonal dimensions: problemtype, problem-time, problem-manifestation and problem-culprit. A configuration problem, by type, is classified as a paramater, compatibility or a missing-component problem. Problem-time is classified as pre-boot-time, boot-time or run-time. A configuration problem manifestation is either silent or non-silent. Problemculprit is either the user or the developer of the application server. Our analysis shows that more than one-third of all problems in each server are configuration problems. Among all configuration problems for each server in phase-1, at-least 50% of problems are paramater-based and occur at run-time. In phase- 2, which focuses on specific versions over a shorter time-period, all three problem types parameter, compatibility and missingcomponent have an almost equal share. Further, on average 89% of configuration problems result in a non-silent manifestation, while 91% of them are due to mistakes by the developer and require code-modification to fix the problem. Finally, we test the robustness to configuration by injecting configuration-bugs at boot-time with SPECjEnterprise2010 application deployed in each server. JBoss performs better than GlassFish with all of the injections giving a non-silent manifestation as opposed to only 65% non-silent manifestations in GlassFish. [ Presentation ] "Using Big Data for Improving Dependability: A Cellular Network Tale," Nawanol Theera-Ampornpunt, Saurabh Bagchi; Kaustubh Joshi, Rajesh Panta (AT&T Labs). At the 9th Workshop on Hot Topics in Dependable Systems (HotDep), held in conjunction with the 24th ACM Symposium on Operating Systems Principles (SOSP), pp. 1-6, November 3, 2013, Nemacolin Woodlands Resort, PA. (Acceptance rate: 11/21 = 52.3%) [ Abstract ] There are many large infrastructures that instrument everything from network performance metrics to user activities. However, the collected data are generally used for long-term planning instead of improving reliability and user experience in real time. In this paper, we present our vision of how such collections of data can be used in real time to enhance the dependability of cellular network services. We first discuss mitigation mechanisms that can be used to improve reliability, but incur a high cost which prohibit them to be used except in certain conditions. We present two case studies where analyses of real cellular network traffic data show that we can identify these conditions. [ Presentation ] “Automatic Problem Localization in Distributed Applications via Multi-dimensional Metric Profiling,” Ignacio Laguna, Subrata Mitra, Fahad A. Arshad, Nawanol Theera-Ampornpunt, Zongyang Zhu, Saurabh Bagchi, Samuel P. Midkiff, Mike Kistler (IBM Research), and Ahmed Gheith (IBM Research), At the 32nd International Symposium on Reliable Distributed Systems (SRDS), pp. 121-132, Braga, Portugal, September 30-October 3, 2013. (Acceptance rate: 22/67 = 32.8%) [ Presentation ] [ Abstract ] Debugging today’s large-scale distributed applications is complex. Traditional debugging techniques such as breakpoint-based debugging and performance profiling require a substantial amount of domain knowledge and do not automate the process of locating bugs and performance anomalies. We present ORION, a framework to automate the problemlocalization process in distributed applications. From large set of metrics, ORION intelligently chooses important metrics and models the application’s runtime behavior through pairwise correlations of those metrics in the system, within multiple nonoverlapping time windows. When correlations deviate from those of a learned correct model due to a bug, our analysis pinpoints the metrics and code regions (class and method within it) that are most likely associated with the failure. We demonstrate our framework with several real-world failure cases in distributed applications such as: HBase, Hadoop DFS, a Purdue campuswide Java application for checking availability of lab machines, and a regression testing framework from IBM. Our results show that ORION is able to pinpoint the metrics and code regions that developers need to concentrate on to fix the failures. WuKong: Automatically Detecting and Localizing Bugs that Manifest at Large System Scales," Bowen Zhou, Jonathan Too, Milind Kulkarni, and Saurabh Bagchi. At the 22nd ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), pp. 131-142, New York City, NY, June 17-21, 2013. (Acceptance rate: 20/131 = 15.3%) [ Presentation ] [ Abstract ] A key challenge in developing large scale applications is nding bugs that are latent at the small scales of testing, but manifest themselves when the application is deployed at a large scale. Here, we ascribe a dual meaning to large scale"\|it could mean a large number of executing processes or applications ingesting large amounts of input data (or both). Traditional statistical techniques fail to detect or di- agnose such kinds of bugs because no error-free run is avail- able at the large deployment scales for training purposes. Prior work used scaling models to detect anomalous behav- ior at large scales without training on correct behavior at that scale. However, that work cannot localize bugs auto- matically, i.e., cannot pinpoint the region of code responsi- ble for the error. In this paper, we resolve that shortcoming by making the following three contributions: (i) we develop an automatic diagnosis technique, based on feature recon- struction; (ii) we design a heuristic to eectively prune the large feature space; and (iii) we demonstrate that our system scales well, in terms of both accuracy and overhead. We val- idate our design through a large-scale fault-injection study and two case-studies of real-world bugs, nding that our system can eectively localize bugs in 92.5% of the cases, dramatically easing the challenge of nding bugs in large- scale programs. "Lilliput meets Brobdingnagian: Data Center Systems Management through Mobile Devices," Saurabh Bagchi, Fahad Arshad, Jan Rellermeyer (IBM Research), Thomas Osiecki (IBM Research), Michael Kistler (IBM Research), and Ahmed Gheith (IBM Research). At the 3rd International Workshop on Dependability of Clouds, Data Centers and Virtual Machine Technology (DCDV 2013), held with the 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1-6, June 24, 2013, Budapest, Hungary. [ Presentation ] [ Abstract ] In this paper, we put forward the notion that systems management for large masses of virtual machines in data centers is going to be done differently in the short to medium term future—through smart phones and through controlled crowdsourcing to a variety of experts within an organization, rather than dedicated system administrators alone. We lay out the research and practitioner challenges this model raises and give some preliminary solution directions that are being developed, here at IBM and elsewhere. "Low-Complexity Secure Protocols to Defend Cyber-Physical Systems Against Network Isolation Attacks," Dong-Hoon Shin, Jinkyu Koo, Lei Yang (Arizona State U.), Xiaojun Lin, Saurabh Bagchi, and Junshan Zhang (Arizona State U.). At the 1st IEEE Conference on Communications and Network Security (CNS), pp. 1-9, Washington DC, October 14-16, 2013. (Acceptance rate: 40/141 = 28.4%) [ Abstract ] This paper studies the network isolation attack, a devastating type of attacks on cyber-physical systems. In this attack, an adversary compromises a set of nodes that enclose a region in order to isolate the region from the rest of the network. Assuming that the compromised nodes wish not to be detected, we propose a solution to defend against the network isolation attack. Our goal is to achieve the following security guarantee: either a legitimate node can successfully deliver a message to another legitimate node, or the network control center can identify a small set of suspect nodes, which are guaranteed to contain a compromised node. Toward achieving this goal, we develop two protocols: one is for secure delivery of messages among nodes and the other is for secure collection of messages from nodes at the network control center. We show that our proposed protocols are provably secure, i.e., attain the aforementioned security guarantee. Further, our protocols achieve this guarantee with overhead that is orders-of-magnitude smaller than existing baseline protocols. Our proposed protocols are thus scalable for large networks. "A Delay-Bounded Event-Monitoring and Adversary-Identification Protocol in Resource-Constrained Sensor Networks," Jinkyu Koo, Dong-Hoon Shin, Xiaojun Lin, and Saurabh Bagchi. Elsevier Ad Hoc Networks, pp. 1820-1835, vol. 11, issue 6, August 2013. [ Abstract ] Event monitoring is a common application in wireless sensor networks. For event monitoring, a number of sensor nodes are deployed to monitor certain phenomenon. When an event is detected, the sensor nodes report it to a base station (BS), where a network operator can take appropriate action based on the event report. In this paper, we are interested in scenarios where the event must be reported within a time bound to the BS possibly over multiple hops. However, such event reports can be hampered by compromised nodes in the middle that drop, modify, or delay the event report. To defend against such an attack, we propose SEM, a Secure Event Monitoring protocol against arbitrary malicious attacks by Byzantine adversary nodes. SEM provides the following provable security guarantees. As long as the compromised nodes want to stay undetected, a legitimate sensor node can report an event to the BS within a bounded time. If the compromised nodes prevent the event from being reported to the BS within the bounded time, the BS can identify a pair of nodes that is guaranteSchool of Electrical and Computer Engineeringed to contain at least one compromised node. To the best of our knowledge, no prior work in the literature can provide such guarantees. SEM is designed to use the minimumlevel of asymmetric cryptography during normal operationwhenthere is no attack, and use cryptographic primitives more liberallywhenan attack is detected. This design has the advantage that the overall SEM protocol is lightweight in terms of the computational resources and the network traffic required by the cryptographic operations. We also show an operational example of SEM using TOSSIM simulations. "An Optimization Framework for Monitoring Multi-Channel Multi-Radio Wireless Mesh Networks," Dong-Hoon Shin and Saurabh Bagchi. Elsevier Ad Hoc Networks, pp. 926-943, Volume 11, Issue 3, May 2013. [ Abstract ] This paper studies an optimal monitoring problem for behavior-based detection in multichannel multi-radio wireless mesh networks. In behavior-based detection, nodes overhear communications in their neighborhood to determine if the behaviors of their neighbors are legitimate. The objective of this work is to maximize the number of nodes being monitored by judiciously choosing a set of monitoring nodes and also channels for the chosen monitoring nodes. This problem is NP-hard, growing exponentially with the number of monitoring nodes. We develop three approximation algorithms, each of which achieves at least a constant factor of the optimum. Furthermore, one of our algorithms achieves the best possible approximation ratio among all polynomial-time algorithms, unless P = NP. We conduct simulations in random networks and scale-free networks to evaluate the coverage and the execution time of the three algorithms. "Toward Optimal Sniffer-Channel Assignment for Reliable Monitoring in Multi-Channel Wireless Networks," Dong-Hoon Shin, Saurabh Bagchi, and Chih-Chun Wang. At the 10th IEEE International Conference on Sensing, Communication, and Networking (SECON), pp. 1-9, New Orleans, LA, June 24-27, 2013. (Acceptance rate: 51/173 = 29.5%) [ Presentation ] [ Abstract ] This paper studies the optimal sniffer-channel assignment for reliable monitoring in multi-channel wireless networks. This problem concerns how to deploy certain sniffers in a network (and tune their channels) so that they can overhear and verify communication among the other nodes, referred to as normal nodes. Prior works have studied the optimal sniffer-channel assignment, but they assume perfect sniffers. However, in practice, sniffers may probabilistically make errors in monitoring, e.g., due to poor reception and compromise by an adversary. Hence, to maintain acceptable monitoring quality, a node needs to be overheard by multiple sniffers. We show that the optimal sniffer-channel assignment with sniffer redundancy differs fundamentally from the previous works due to the absence of a desirable property called submodularity. As a result, in our problem, the prior approximation algorithms no longer maintain their performance guarantees. We propose a variety of approximation algorithms based on two approaches—greedy strategy and relaxation-and-rounding approach. We present an empirical performance analysis of the proposed algorithms through simulations in practical networks. Our results suggest that our two algorithms show a performance trade-off between coverage and running time and are therefore suitable for different kinds of deployment. "Mitigating the Effects of Software Component Shifts for Incremental Reprogramming of Wireless Sensor Networks," Rajesh Krishna Panta and Saurabh Bagchi. IEEE Transactions on Parallel and Distributed Systems (TPDS), pp. 1882-1894, vol. 23, issue 10, October 2012. [ Abstract ] Wireless reprogramming of sensor nodes is an essential requirement for long-lived networks because software functionality needs to be changed over time. The amount of information that needs to be wirelessly transmitted during reprogramming should be minimized to reduce reprogramming time and energy. In this paper, we present a multihop incremental reprogramming system called Hermes that transfers over the network the delta between the old and new software and lets the sensor nodes rebuild the new software using the received delta and the old software. It reduces the delta by using techniques to mitigate the effects of function and global variable shifts caused by the software modifications. Then it compares the binary images at the byte level with a method to create a small delta that needs to be sent over the wireless network to all the nodes. For the wide range of software change scenarios that we experimented with, we find that Hermes transfers up to 201 times less information than Deluge, the standard reprogramming system for TinyOS, and 64 times less than an existing incremental reprogramming system by Jeong and Culler. Distributed mobility management for efficient video delivery over all-IP mobile networks: Competing approaches," Dong-Hoon Shin, Danny Moses (Intel), Muthaiah Venkatachalam (Intel), and Saurabh Bagchi. IEEE Network magazine, vol.27, no.2, pp.28-33, March-April 2013. [ Abstract ] The recent proliferation of multimedia mobile devices and a variety of mobile applications are generating an enormous amount of data traffic over mobile networks. The key driver of the mobile traffic growth is mobile video. Currently, mobile networks are evolving to the 4G system, which has a flatter architecture and provides all-IP-based mobile broadband service. In all-IP mobile networks, IP mobility management is a key function that allows mobile nodes to continue their communications even when their point of attachment to the IP network changes. Existing mobile networks employ a centralized mobility management scheme where all intelligence is concentrated in one end-point system, rather than being distributed through the internet. However, this cannot satisfactorily support mobile videos, which demand a large volume of data and often require QoS such as session continuity and low delay. This motivates distributed mobility management (DMM) solutions that can efficiently handle mobile video traffic. In this article, we survey different approaches for DMM in standards development organizations such as IETF and 3GPP, and also in research organizations. We focus on three different DMM approaches that are currently being considered by the IETF: PMIPv6-based, MIPv6-based, and routing-based DMMs. We provide a qualitative analysis to compare the three DMM approaches and discuss which DMM approaches are more suitable for efficient mobile video delivery.

2012
"ABHRANTA: Locating Bugs that Manifest at Large System Scales," Bowen Zhou, Milind Kukarni, and Saurabh Bagchi. At the 8th Workshop on Hot Topics in System Dependability (HotDep) (co-located with OSDI '12), pp. 1-6, Hollywood, CA, October 7, 2012. (Acceptance rate: 10/24 = 41.7%) [ Presentation ] [ Abstract ] A key challenge in developing large scale applications (both in system size and in input size) is finding bugs that are latent at the small scales of testing, only manifesting when a program is deployed at large scales. Traditional statistical techniques fail because no error-free run is available at deployment scales for training purposes. Prior work used scaling models to detect anomalous behavior at large scales without being trained on correct behavior at that scale. However, that work cannot localize bugs automatically. In this paper, we extend that work with automatic diagnosis technique, based on feature reconstruction, and validate our design through case studies with two real bugs from an MPI library and a DHT-based file sharing application. “mcrEngine: A Scalable Checkpointing System using Data-Aware Aggregation and Compression,” Tanzima Zerin Islam, Kathryn Mohror, Saurabh Bagchi, Adam Moody, Bronis R. de Supinski, and Rudolf Eigenmann. At the IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (Supercomputing), pp. 1-10, Salt Lake City, Utah, November 10-16, 2012. (Acceptance rate: 100/472 = 21.2%) (One of 8 papers that is a finalist for the best student paper) [ Presentation ][ Abstract ] High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint- restart system, MCRENGINE. MCRENGINE aggregates check- points from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that MCRENGINE reduces check- pointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression. "To Cloud or Not to Cloud: A Study of Trade-offs between In-house and Outsourced Virtual Private Network," Fahad A. Arshad, Gaspar Modelo-Howard, and Saurabh Bagchi. At the 7th workshop on Secure Network Protocols (NPSec 2012) (co-located with ICNP '12), pp. 1-6, Austin, TX, October 30, 2012. (Acceptance rate not available) [ Abstract ] The question of whether to migrate IT services to a cloud computing infrastructure arises before most IT decision makers today. To enable secure access to sensitive resources a virtual private network (VPN) is almost a required piece of technology. Setting up and managing a VPN server is a non-trivial task—there are a variety of modes in which VPN can be used (IPSec, SSL/TLS, PPTP), there are a variety of software-only and software-hardware solutions, and each comes with a rich set of configuration options. Therefore, it is a perplexing question to practitioners what option to choose, with an understanding of the performance and the security implications of each choice. In this paper, we consider the various factors that should go into such decision making and exemplify this by choosing among two competitive options for protecting access to IT resources of our NSF center which has a significant number of external (i.e., non-Purdue) users. The two options are an open-source software-only VPN (pfSense) and a commercial appliance, i.e., an integrated hardware-software solution. Further, the first is managed by us while the latter is outsourced to an entity that provides VPN services to multiple consumer organizations, and hence, referred by us as the cloudbased service. We follow up with conducting a post-deployment study of the VPN users which reveals that despite a two-fold reduction in throughput, the cloud-based service is considered satisfactory due to its non-intrusiveness with respect to other network activities and ease of configuration. "Multi-Armed Bandit Congestion Control in Multi-Hop Infrastructure Wireless Mesh Networks,” A. B. M. Alim Al Islam, S. M. Iftekharul Alam, Vijay Raghunathan, and Saurabh Bagchi. At the 20th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 31-40, August 7-9, 2012, Arlington, VA. (Acceptance rate: 49/136 = 36%, our paper was one of 7 accepted papers selected for 10 page publication limit, the others have an 8 page limit) [ Abstract ] Congestion control in multi-hop infrastructure wire- less mesh networks is both an important and a unique problem. It is unique because it has two prominent causes of failed transmissions which are difficult to tease apart - lossy nature of wireless medium and high extent of congestion around gateways in the network. The concurrent presence of these two causes limits applicability of already available congestion control mechanisms, proposed for wireless networks. Prior mechanisms mainly focus on the former cause, ignoring the latter one. Therefore, we address this issue to design an end-to-end congestion control mechanism for infrastructure wireless mesh networks in this paper. We formulate the congestion control problem and map that to the restless multi-armed bandit problem, a well-known decision problem in the literature. Then, we propose three myopic policies to achieve a near-optimal solution for the mapped problem since no optimal solution is known to this problem. We perform comparative evaluation through ns-2 simulation and a real testbed experiment with a wireline TCP variant and a wireless TCP protocol. The evaluation reveals that our proposed mechanism can achieve up to 52% increased network throughput and 34% decreased average energy consumption per transmitted bit in comparison to the other end-to-end congestion control variants. “Privatus: Wallet-Friendly Privacy Protection for Smart Meters,” Jinkyu Koo, Xiaojun Lin, and Saurabh Bagchi. At the 17th European Symposium on Research in Computer Security (ESORICS), pp. 1-18, September 10-4, 2012, Pisa, Italy. (Acceptance rate: 50/248 = 20.2%) [ Presentation ] [ Abstract ] Debugging large-scale parallel applications is challenging. Most existing techniques provide mechanisms for process control, but they provide little information about the causes of failures. Most debuggers also scale poorly despite contin- ued growth in supercomputer core counts. We have devel- oped a novel, highly scalable tool to help developers under- stand and fix correctness problems at scale. Our tool proba- bilistically infers the least progressed task in MPI programs using Markov models of execution history and dependence analysis. This analysis guides program slicing to find code that may have caused a failure. In a blind study, we demon- strate that our tool can isolate the root cause of a partic- ularly perplexing bug encountered at scale in a molecular dynamics simulation. Further, we perform fault injections into two benchmark codes and measure the scalability of the tool. Our results show that it accurately detects the least pro- gressed task in most cases and can perform the diagnosis in a fraction of a second with thousands of tasks. “Probabilistic Diagnosis of Performance Faults in Large Scale Parallel Applications,” Ignacio Laguna, Dong H. Anh, Bronis R. de Supinski, Saurabh Bagchi, and Todd Gamblin. At the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 1-10, September 19-23, 2012, Minneapolis, MN. (Acceptance rate: 39/207 = 18.8%) [ Presentation ] [ Abstract ] Debugging large-scale parallel applications is challenging. Most existing techniques provide mechanisms for process control, but they provide little information about the causes of failures. Most debuggers also scale poorly despite contin- ued growth in supercomputer core counts. We have devel- oped a novel, highly scalable tool to help developers under- stand and fix correctness problems at scale. Our tool proba- bilistically infers the least progressed task in MPI programs using Markov models of execution history and dependence analysis. This analysis guides program slicing to find code that may have caused a failure. In a blind study, we demon- strate that our tool can isolate the root cause of a partic- ularly perplexing bug encountered at scale in a molecular dynamics simulation. Further, we perform fault injections into two benchmark codes and measure the scalability of the tool. Our results show that it accurately detects the least pro- gressed task in most cases and can perform the diagnosis in a fraction of a second with thousands of tasks. "Automatic Fault Characterization via Abnormality-Enhanced Classification," Greg Bronevetsky (LLNL), Ignacio Laguna, Saurabh Bagchi and Bronis R. de Supinski (LLNL). In the 42th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1-12, Boston, MA, June 25-28, 2012 (Acceptance rate: 51/236 = 21.6%) [ Presentation ] [ Abstract ] Enterprise and high-performance computing systems are growing extremely large and complex, employing many processors and diverse software/hardware stacks [1]. As these machines grow in scale, faults become more frequent and system complexity makes it difficult to detect and diagnose them. The difficulty is particularly large for faults that degrade system performance or cause erratic behavior but do not cause outright crashes. The cost of these errors is high since they significantly reduce system productivity, both initially and by time required to resolve them. Current system management techniques [2], [3] do not work well since they require manual examination of system behavior and do not identify root causes. When a fault is manifested, system administrators need timely notification about the type of fault, the time period in which it occurred and the processor on which it originated. Statistical modeling approaches can accurately character- ize normal and abnormal system behavior [4]. However, the complex effects of system faults are less amenable to these techniques. This paper demonstrates that the com- plexity of system faults makes traditional classification and clustering algorithms inadequate for characterizing them. We design novel techniques that combine classification algorithms with information on the abnormality of appli- cation behavior to improve detection and characterization accuracy significantly. Our experiments demonstrate that our techniques can detect and characterize faults with 85% accuracy, compared to just 12% accuracy for direct applications of traditional techniques. "A Study of Soft Error Consequences in Hard Disk Drives," Timothy Tsai (Hitachi GST), Nawanol Theera-Ampornpunt and Saurabh Bagchi. In the 42th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) (Practical Experience Report), pp. 1-8, Boston, MA, June 25-28, 2012 (Acceptance rate: 51/236 = 21.6%) [ Presentation ] [ Abstract ] Hard disk drives have multiple layers of fault tolerance mechanisms that protect against data loss. However, a few failures occasionally breach the entire set of mechanisms. To prevent such scenarios, we rely on failure prediction mechanisms to raise alarms with sufficient warning to allow the at-risk data to be copied to a safe location. A common failure prediction technique monitors the occurrence of soft errors and triggers an alarm when the soft error rate exceeds a specified threshold. This study uses data collected from over 50,000 customer deployed disk drives to evaluate the performance of a failure prediction algorithm that relies solely on soft errors to predict failures manifested as hard errors. The data analysis shows that soft errors alone cannot be used as a reliable predictor of hard errors. However, in those cases where soft errors do accurately predict hard errors, sufficient warning time exists for preventive actions. "An Empirical Study of the Robustness of Inter-component Communication in Android," Amiya K. Maji, Fahad A. Arshad, Saurabh Bagchi and Jan S. Rellermeyer (IBM Research, Austin) In the 42th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1-12, Boston, MA, June 25-28, 2012 (Acceptance rate: 51/236 = 21.6%) [ Presentation ] [ Abstract ] Over the last three years, Android has established itself as the largest-selling operating system for smartphones. It boasts of a Linux-based robust kernel, a modular framework with multiple components in each application, and a security- conscious design where each application is isolated in its own virtual machine. However, all of these desirable properties would be rendered ineffectual if an application were to deliver erroneous messages to targeted applications and thus cause the target to behave incorrectly. In this paper, we present an empir- ical evaluation of the robustness of Inter-component Commu- nication (ICC) in Android through fuzz testing methodology, whereby, parameters of the inter-component communication are changed to various incorrect values. We show that not only exception handling is a rarity in Android applications, but also it is possible to crash the Android runtime from unprivileged user processes. Based on our observations, we highlight some of the critical design issues in Android ICC and suggest solutions to alleviate these problems. "Dependability as a Cloud Service - A Modular Approach," Jan S. Rellermeyer (IBM Research) and Saurabh Bagchi. At the Second International Workshop on Dependability of Clouds, Data Centers and Virtual Machine Technology (DCDV 2012) (co-located with DSN '12), pp. 1-6, Boston, MA, June 25, 2012. (Acceptance rate not available) [ Abstract ] Failures of services on cloud platforms are only to be expected. To deal with such failures, one is naturally inclined to use the traditional measure of replication. However, replication of services on distributed cloud platforms poses several challenges that are not well met by today’s Java middleware systems. These challenges are the need to isolate state in the application components so that easy migration and recovery are possible and the requirement for client transparency when dealing with different replicated service instances. For example, Java Enterprise Edition (JEE) makes it difficult to have transparent replication of services due to the above two reasons plus the fine-grained nature of interactions between its components (the Enterprise Java Beans). In this paper, we show parts of the design of OSGi, a specification defining a dynamic component system in Java, that make it suitable for the above task. We then propose two extensions to OSGi which will allow exposing and exporting application component state and transparent invocation of service instances. These two together can enable easy replication and recovery from failures in cloud environments. We show through experiments that our prototype can migrate a failed service quickly enough to a new machine so that a client experiences only a moderate increase in service invocation time during system recovery.

2011
“Distributed Online Channel Assignment Toward Optimal Monitoring in Multi-Channel Wireless NetworksDong-Hoon Shin, Saurabh Bagchi, and Chih-Chun Wang, At the 31st Annual IEEE International Conference on Computer Communications (INFOCOM) Mini-conference, pp. 2918-2912, 2012. [ Presentation ] [ Abstract ] This paper studies an optimal channel assignment problem for passive monitoring in multi-channel wireless networks, where a set of sniffers capture and analyze the network traffic to monitor the network. The objective of this problem is to maximize the total amount of traffic captured by sniffers by judiciously assigning the radios of sniffers to a set of channels. This problem is NP-hard, with the computational complexity growing exponentially with the number of sniffers. We develop distributed online solutions to this problem for large-scale and dynamic networks. Prior works have attained a constant factor1 - 1/e of the maximum monitoring coverage in a centralized setting. Our algorithm preserves the same ratio while providing a distributed solution that is amenable to online implementation. Also, our algorithm is cost-effective, in terms of communication and computational overheads, due to the use of only local communication and the adaptation to incremental network changes. We present two operational modes of our algorithm for two types of networks that have different rates of network changes. One is a proactive mode for fast varying networks, while the other is a reactive mode for slowly varying networks. Simulation results demonstrate the effectiveness of the two modes of our algorithm. "Large Scale Debugging of Parallel Tasks with AutomaDeD," Ignacio Laguna, Todd Gamblin, Bronis R. de Supinski, Saurabh Bagchi, Greg Bronevetsky, Dong H. Ahn, Martin Schulz, and Barry Rountree, At the Supercomputing Conference, 12 pages, Seattle, WA, Nov 12-18, 2011. (Acceptance rate: 74/352 = 21.0%) [ Presentation ] [ Abstract ] Developing correct HPC applications continues to be a challenge as the number of cores increases in today’s largest systems. Most existing debugging techniques perform poorly at large scales and do not automatically locate the parts of the parallel application in which the error occurs. The overhead of collecting large amounts of runtime information and an absence of scalable error detection algorithms generally cause poor scalability. In this work, we present novel, highly efficient techniques that facilitate the process of debugging large scale parallel applications. Our approach extends our previous work, AutomaDeD, in three major areas to isolate anomalous tasks in a scalable manner: (i) we efficiently compare elements of graph models (used in AutomaDeD to model parallel tasks) using pre-computed lookup-tables and by pointer comparison; (ii) we compress per-task graph models before the error detection analysis so that comparison between models involves many fewer elements; (iii) we use scalable sampling-based clustering and nearest-neighbor techniques to isolate abnormal tasks when bugs and performance anomalies are manifested. Our evaluation with fault injections shows that AutomaDeD scales well to thousands of tasks and that it can find anomalous tasks in under 5 seconds in an online manner. "Secure Configuration of Intrusion Detection Sensors for Changing Enterprise Systems," Gaspar Modelo-Howard, Jevin Sweval, and Saurabh Bagchi, At the 7th International ICST Conference on Security and Privacy in Communication Networks (SecureComm), 20 pages (LNCS format), London, UK, Sep 7-9, 2011. (Acceptance rate: 23/95 = 24.2%) [ Abstract ] [ Presentation ] Current attacks to distributed systems involve multiple steps, due to attackers usually taking multiple actions to achieve their goals. Such attacks are called multi-stage attacks and have the ultimate goal to compromise a critical asset for the victim. An example would be compromising a web server, then achieve a series of intermediary steps (such as compromising a developer’s box thanks to a vulnerable PHP module and connecting to a FTP server with gained credentials) to ultimately connect to a database where user credentials are stored. Current detection systems are not capable of analyzing the multi-step attack scenario. In this document we present a distributed detection framework based on a probabilistic reasoning engine that communicates to detection sensors and can achieve two goals: (1) protect the critical asset by detecting multi-stage attacks and (2) tune sensors according to the changing environment of the distributed system monitored by the distributed framework. As shown in the experiments, the framework reduces the number of false positives that it would otherwise report if it were only considering alerts from a single detector and the reconfiguration of sensors allows the framework to detect attacks that take advantage of the changing system environment. "v-CAPS: A Confidentiality and Anonymity Preserving Routing Protocol for Content-Based Publish-Subscribe Networks," Amiya Maji and Saurabh Bagchi, At the 7th International ICST Conference on Security and Privacy in Communication Networks (SecureComm), 20 pages (LNCS format), London, UK, Sep 7-9, 2011. (Acceptance rate: 23/95 = 24.2%) [ Abstract ] Content-based Publish-Subscribe (CBPS) is a widely used communication paradigm where publishers "publish" messages and a set of subscribers receive these messages based on their interests through filtering and routing by an intermediate set of brokers. CBPS has proven to be suitable for many-to-many communication offering flexibility and efficiency in communications between a dynamic set of publishers and subscribers. We are interested in using CBPS in healthcare settings to disseminate health-related information (drug interactions, diagnostic information on diseases) to large numbers of subscribers in a confidentiality-preserving manner. Confidentiality in CBPS requires that the message be hidden from brokers whereas the brokers need the message to compute routing decisions. Previous approaches to achieve these conflicting goals suffer from significant shortcomings – misrouting, lesser expressivity of subscriber interests, high execution time, and high message overhead. Oursolution, titled v-CAPS, achieves the competing goals while avoiding the previous problems. In v-CAPS, the trusted publishers extract the routing information based on the message and the brokers keep minimal information needed to perform local routing. The routing information is cryptographically secured so that curious brokers or other subscribers cannot learn about the recipients. Our experiments show that v-CAPS has comparable end-to-end message latency to a baseline insecure CBPS system with unencrypted routing vectors. However, the cost of hiding the routing vectors from the brokers is significantly higher. [ Presentation ] "Dangers and Joys of Stock Trading on the Web: Failure Characterization of a Three-Tier Web Service," Fahad Arshad and Saurabh Bagchi, At the 30th IEEE Symposium on Reliable Distributed Systems (SRDS), 10 pages, Madrid, Spain, Oct 5-7, 2011. (Acceptance rate: 30/88 = 34.1%) [ Presentation ] [ Abstract ] Characterizing latent software faults is crucial to address dependability issues of current three-tier systems. A client should not have a misconception that a transaction succeeded, when in reality, it failed due to a silent error. We present a fault injection-based evaluation to characterize silent and non-silent software failures in a representative threetier web service, one that mimics a day trading application widely used for benchmarking application servers. For failure characterization, we quantify distribution of silent and nonsilent failures, and recommend low cost application-generic and application-specific consistency checks, which improve the reliability of the application. We inject three variants of nullcall, where a callee returns null to the caller without executing business logic. Additionally, we inject three types of unchecked exceptions and analyze the reaction of our application. Our results show that 49% of error injections from null-calls result in silent failures, while 34% of unchecked exceptions result in silent failures. Our generic-consistency check can detect silent failures in null-calls with an accuracy as high as 100%. Nonsilent failures with unchecked exceptions can be detected with an accuracy of 42% with our application-specific checks. "Aveksha: A Hardware-Software Approach for Non-intrusive Tracing and Profiling of Wireless Embedded Systems,": Matthew Tancreti, Mohammad Hossain, Saurabh Bagchi, and Vijay Raghunathan. In: 9th ACM Conference on Embedded Networked Sensor Systems (SenSys), 14 pages, Seattle, WA, Nov 1-4, 2011. (Winner of the best paper award) (Acceptance rate: 24/123 = 19.5%) [ Presentation ] [ Abstract ] It is important to get an idea of the events occurring in an embedded wireless node when it is deployed in the field, away from the convenience of an interactive debugger. Such visibility can be useful for post-deployment testing, replaybased debugging, and for performance and energy profiling of various software components. Prior software-based solutions to address this problem have incurred high execution overhead and intrusiveness. The intrusiveness changes the intrinsic timing behavior of the application, thereby reducing the fidelity of the collected profile. Prior hardware-based solutions have involved the use of dedicated ASICs or other tightly coupled changes to the embedded node’s processor, which significantly limits their applicability. In this paper, we present AVEKSHA, a hardware-software approach for achieving the above goals in a non-intrusive manner. Our approach is based on the key insight that most embedded processors have an on-chip debug module (which has traditionally been used for interactive debugging) that provides significant visibility into the internal state of the processor. We design a debug board that interfaces with the on-chip debug module of an embedded node’s processor through the JTAG port and provides three modes of event logging and tracing: breakpoint, watchpoint, and program counter polling. Using expressive triggers that the onchip debug module supports, AVEKSHA can watch for, and record, a variety of programmable events of interest. A key feature of AVEKSHA is that the target processor does not have to be stopped during event logging (in the last two of the three modes), subject to a limit on the rate at which logged events occur. AVEKSHA also performs power monitoring of the embedded wireless node and, importantly, enables power consumption data to be correlated to events of interest. AVEKSHA is an operating system-agnostic solution. We demonstrate its functionality and performance using three applications running on the TelosB motes; two in TinyOS and one in Contiki. We show that AVEKSHA can trace tasks and other generic events and can perform energy monitoring at function and task-level granularity. We also describe how we used AVEKSHA to find a subtle bug in the TinyOS low power listening protocol. “Vrisha: Using Scaling Properties of Parallel Programs for Bug Detection and Localization,” Bowen Zhou, Milind Kulkarni, and Saurabh Bagchi, At the 20th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 12 pages, San Jose, California, June 8-11, 2011. (Acceptance rate: 22/170 = 12.9%) [ Presentation ] [ Abstract ] Detecting and isolating bugs that arise in parallel programs is a tedious and a challenging task. An especially subtle class of bugs are those that are scale-dependent: while small-scale test cases may not exhibit the bug, the bug arises in large-scale production runs, and can change the result or performance of an application. A popular approach to finding bugs is statistical bug detection, where abnormal behavior is detected through comparison with bug-free behavior. Unfortunately, for scale-dependent bugs, there may not be bug-free runs at large scales and therefore traditional sta- tistical techniques are not viable. In this paper, we propose Vrisha, a statistical approach to detecting and localizing scale-dependent bugs. Vrisha detects bugs in large-scale programs by building models of behavior based on bug-free behavior at small scales. These models are constructed using kernel canonical correlation analysis (KCCA) and exploit scale-determined properties, whose values are predictably dependent on application scale. We use Vrisha to detect and diagnose two bugs that appear in popular MPI libraries, and show that our techniques can be implemented with low overhead and low false-positive rates. “Efficient Incremental Code Update for Sensor Networks,” Rajesh Krishna Panta, Saurabh Bagchi, and Samuel P. Midkiff, ACM Transactions on Sensor Networks, pp. 1-32, Vol. 7, No. 4, February 2011. [ Abstract ] Wireless reprogramming of sensor nodes is an essential requirement for long-lived networks since software functionality needs to be changed over time. During reprogramming, the number of radio transmissions should be minimized, since reprogramming time and energy depend chiefly on the number of radio transmissions. In this article, we present a multihop incremental reprogramming protocol called Zephyr that transfers the delta between old and new software versions, and lets the sensor nodes rebuild the new software using the received delta and the old software. Zephyr reduces the delta size by using application-level modifications to mitigate the effects of function shifts. Then it compares the two binary images at the byte level to generate a small delta, that is then sent over the wireless network to all the nodes. For the wide range of software change cases that we used as benchmarks, Zephyr transfers 1.83 to 1987 times less traffic through the network than Deluge (the standard nonincremental reprogramming protocol for TinyOS) and 1.14 to 49 times less traffic than an existing incremental reprogramming protocol by Jeong and Culler [2004]. “Dependence-based Multi-level Tracing and Replay for Wireless Sensor Networks Debugging,” Man Wang, Zhiyuan Li, Feng Li, Xiaobing Feng, Saurabh Bagchi, and Yung-Hsiang Lu, At the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, Tools and Theory for Embedded Systems (LCTES), 10 pages, April 12-14, 2011. (Acceptance rate: 17/48 = 35.4%) [ Abstract ] Due to resource constraints and unreliable communication, wireless sensor network (WSN) programming and debugging remain to be challenging tasks. Deterministic replay is an error diagnosis method which has long been proposed for distributed systems. However, one of the significant hurdles for applying deterministic replay on WSN is posed by the small program memory on typical sensor nodes. This paper proposes a dependence-based multi-level method for memory-efficient tracing and replay. In the interest of portability across different hardware platforms, the method is implemented as a source-level tracing and replaying tool. To further reduce the code size after tracing instrumentation, a cost model is used for making the decision on which functions to in-line. A prototype for the tool targets C programs is developed on top of the Open64 compiler and is tested using several TinyOS applications running on TelosB motes. Preliminary experimental results show that the test programs, which do not fit the program memory after straightforward instrumentation, can be successfully instrumented using the new method such that the injected errors can be found.

2010
"Secure neighbor discovery through overhearing in static multihop wireless networks," Srikanth Hariharan, Ness B. Shroff, and Saurabh Bagchi, Elsevier Computer Networks, vol. 55, issue 6, pp. 1229-1241, April 2011. [ Abstract ] In wireless ad-hoc and sensor networks, neighbor discovery is one of the first steps performed by a node upon deployment and disrupting it adversely affects a number of routing, MAC, topology discovery and intrusion detection protocols. It is especially harmful when an adversary can convince nodes that it is a legitimate neighbor, which it can do easily and without the use of cryptographic primitives. In this paper, we develop a secure neighbor discovery protocol, SEDINE, for static multihop wireless networks. We prove that, in the absence of packet losses, without using any centralized trusted node or specialized hardware, SEDINE prevents any node, legitimate or malicious, from being incorrectly added to the neighbor list of another legitimate node that is not within its transmission range. We provide simulation results to demonstrate the efficacy of SEDINE, in the presence of packet losses. "Stealthy Attacks in Wireless Ad Hoc Networks: Detection and Countermeasure,” Issa Khalil and Saurabh Bagchi, IEEE Transactions on Mobile Computing (TMC), vol. 10, issue 8, pp. 1096-1112, August 2011. [ Abstract ] Stealthy packet dropping is a suite of four attacks --- misrouting, power control, identity delegation and colluding collision --- that can be easily launched against multihop wireless ad hoc networks. Stealthy packet dropping disrupts the packet from reaching the destination through malicious behavior at an intermediate node. However, the malicious node gives the impression to its neighbors that it performs the legitimate forwarding action. Moreover, a legitimate node comes under suspicion. A popular method for detecting attacks in wireless networks is behaviorbased detection performed by normal network nodes through overhearing the communication in their neighborhood. This leverages the open broadcast nature of wireless communication. An instantiation of this technology is local monitoring. We show that local monitoring, and the wider class of overhearing-based detection, cannot detect stealthy packet dropping attacks. Additionally, it mistakenly detects and isolates a legitimate node. We present a protocol called SADEC that can detect and isolate stealthy packet dropping attack efficiently. SADEC presents two techniques that can be overlaid on basic local monitoring: having the neighbors maintain additional information about the routing path, and adding some checking responsibility to each neighbor. Additionally, SADEC provides an innovative mechanism to better utilize local monitoring by considerably increasing the number of nodes in a neighborhood that can do monitoring. We show through analysis and simulation experiments that basic local monitoring fails to efficiently mitigate most of the presented attacks while SADEC successfully mitigates them. "Characterizing Failures in Mobile OSes: A Case Study with Android and Symbian": Amiya Kumar Maji, Kangli Hao, Salmin Sultana, and Saurabh Bagchi. At the 21st annual International Symposium on Software Reliability Engineering (ISSRE 2010), 10 pages, Nov 1-4, 2010, San Jose, California. (Acceptance rate: 40/130 = 30.8%) [ Abstract ] As smart phones grow in popularity, manufacturers are in a race to pack an increasingly rich set of features into these tiny devices. This brings additional complexity in the system software that has to fit within the constraints of the devices (chiefly memory, stable storage, and power consumption) and hence, new bugs are revealed. How this evolution of smartphones impacts their reliability is a question that has been largely unexplored till now. With the release of open source OSes for hand-held devices, such as, Android (open sourced in October 2008) and Symbian (open sourced in February 2010), we are now in a position to explore the above question. In this paper, we analyze the reported cases of failures of Android and Symbian based on bug reports posted by thirdparty developers and end users and documentation of bug fixes from Android developers. First, based on 628 developer reports, our study looks into the manifestation of failures in different modules of Android and their characteristics, such as, their transience in time. Next, we analyze similar properties of Symbian bugs based on 153 failure reports. Our study indicates that Development tools, Web browsers, and Multimedia applications are most error-prone in both these systems. We further analyze 233 bug fixes for Android and categorized the different types of code modifications required for the fixes. The analysis shows that 78% of errors required minor code changes, with the largest share of these coming from modifications to attribute values and conditions. Our final analysis focuses on the relation between customizability, code complexity, and reliability in Android and Symbian. We find that despite high cyclomatic complexity, the bug densities in Android and Symbian are surprisingly low. However, the support for customizability does impact the reliability of mobile OSes and there are cautionary tales for their further development. “Fixed Cost Maintenance for Information Dissemination in Wireless Sensor Networks,” Rajesh Krishna Panta, Madalina Vintila, and Saurabh Bagchi, At the 29th IEEE Symposium on Reliable Distributed Systems (SRDS), pp. 54-63, October 31-November 3, 2010, New Delhi, India. (Acceptance rate: 21/93 = 22.6%)[ Presentation ] [ Abstract ] Because of transient wireless link failures, incremental node deployment, and node mobility, existing information dissemination protocols used in wireless ad-hoc and sensor networks cause nodes to periodically broadcast "advertisement" containing the version of their current data item even in the "steady state" when no dissemination is being done. This is to ensure that all nodes in the network are up-to-date. This causes a continuous energy expenditure during the steady state, which is by far the dominant part of a network’s lifetime. In this paper, we present a protocol called Varuna which incurs a constant energy cost, independent of the duration of the steady state. In Varuna, nodes monitor the traffic pattern of the neighboring nodes to decide when an advertisement is necessary. Using testbed experiments and simulations, we show that Varuna achieves several orders of magnitude energy savings compared to Trickle, the existing standard for dissemination in sensor networks, at the expense of a reasonable amount of memory for state maintenance. "RDAS: Reputation-Based Resilient Data Aggregation in Sensor Network": Carlos Perez-Toro, Rajesh Krishna Panta, and Saurabh Bagchi. In the 7th Annual IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks (SECON), 9 pages, June 21-25, 2010, Boston, Massachusetts. (Acceptance rate: 63/274 = 23.0%) [ Presentation ] [ Abstract ] Data aggregation in wireless sensor networks is vulnerable to security attacks and natural failures. A few nodes can drastically alter the result of the aggregation by reporting erroneous data. In this paper we present RDAS, a robust data aggregation protocol that uses a reputation-based approach to identify and isolate malicious nodes in a sensor network. RDAS is based on a hierarchical clustering arrangement of nodes, where a cluster head analyzes data from the cluster nodes to determine the location of an event. It uses the redundancy of multiple nodes sensing an event to determine what data should have been reported by each node. Nodes form part of a distributed reputation system, where they share information about other node’s performance in reporting accurate data and use the reputation ratings to suppress the reports from malicious nodes. RDAS is able to perform accurate data aggregation in the presence of individually malicious and colluding nodes, as well as nodes that try to compromise the integrity of the reputation system by lying about other nodes’ behavior. We show that RDAS is more resilient to security attacks with respect to accuracy of event localization than the baseline data aggregation protocol with no security feature. "AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks": Greg Bronevetsky, Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin Schulz. In the 40th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 10 pages, June 28-July 1, 2010, Chicago, IL. (Acceptance rate (DCCS track): 40/174 = 23%) [ Presentation ] [ Abstract ] Today’s largest systems have over 100,000 cores, with million-core systems expected over the next few years. This growing scale makes debugging the applications that run on them a daunting challenge. Few debugging tools perform well at this scale and most provide an overload of information about the entire job. Developers need tools that quickly direct them to the root cause of the problem. This paper presents AutomaDeD, a tool that identifieswhich tasks of a large-scale application first manifest a bug at a specific code region and specific program execution point. AutomaDeD statistically models the application’s controlflow and timing behavior, grouping tasks and identifying deviations from normal execution, which significantly reduces debugging effort. In addition to a case study in which AutomaDeD locates a bug that occurred during development of MVAPICH, we evaluate AutomaDeD on a range of bugs injected into the NAS parallel benchmarks. Our results demonstrate that AutomaDeD detects the time period when a bug first manifested with 90% accuracy for stalls and hangs and 70% accuracy for interference faults. It identifies the subset of processes first affected by the fault with 80% accuracy and 70% accuracy, respectively and the code region where the fault first manifested with 90% and 50% accuracy, respectively. "Statistical Fault Detection for Parallel Applications with AutomaDeD," Greg Bronevetsky, Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin Schulz, At the 6th IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE), pp. 1-6, Stanford, CA, Mar 23-24, 2010. [ Abstract ] Today’s largest systems have over 100,000 cores, with million-core systems expected over the next few years. The large component count means that these systems fail frequently and often in very complex ways, making them difficult to use and maintain. While prior work on fault detection and diagnosis has focused on faults that significantly reduce system functionality, the wide variety of failure modes in modern systems makes them likely to fail in complex ways that impair system performance but are difficult to detect and diagnose. This paper presents AutomaDeD, a statistical tool that models the timing behavior of each application task and tracks its behavior to identify any abnormalities. If any are observed, AutomaDeD can immediately detect them and report to the system administrator the task where the problem began. This identification of the fault’s initial manifestation can provide administrators with valuable insight into the fault’s root causes, making it significantly easier and cheaper for them to understand and repair it. Our experimental evaluation shows that AutomaDeD detects a wide range of faults immediately after they occur 80% of the time, with a low false-positive rate. Further, it identifies weaknesses of the current approach that motivate future research.
2009
"How To Keep Your Head Above Water While Detecting Errors": Ignacio Laguna, Fahad A. Arshad, David M. Grothe, and Saurabh Bagchi. In: ACM/IFIP/USENIX 10th International Middleware Conference, November 30-December 4, 2009, Urbana-Champaign, Illinois. (Acceptance rate: 21/110 = 19.1%) [ Presentation ] [ abstract ] Today’s distributed systems need runtime error detection to catch errors arising from software bugs, hardware errors, or unexpected operating conditions. A prominent class of error detection techniques operates in a stateful manner, i.e., it keeps track of the state of the application being monitored and then matches state-based rules. Large-scale distributed applications generate a high volume of messages that can overwhelm the capacity of a stateful detection system. An existing approach to handle this is to randomly sample the messages and process a subset. However, this approach, leads to non-determinism with respect to the detection system’s view of what state the application is in. This in turn leads to degradation in the quality of detection. We present an intelligent sampling algorithm and a Hidden Markov Model (HMM)-based algorithm to select the messages that the detection system processes and determine the application states such that the non-determinism is minimized. We also present a mechanism for selectively triggering computationally intensive rules based on a light-weight mechanism to determine if the rule is likely to be flagged. We demonstrate the techniques in a detection system called Monitor applied to a J2EE multi-tier application. We empirically evaluate the performance of Monitor under different load conditions and error scenarios and compare it to a previous system called Pinpoint. "FALCON: A System for Reliable Checkpoint Recovery in Shared Grid Environments": Tanzima Zerin, Saurabh Bagchi, and Rudolf Eigenmann. In: the ACM/IEEE Supercomputing Conference, November 14-20, 2009, Portland, Oregon. (Acceptance rate: 59/261 = 22.6%) (Nominated as one of 4 best student papers) [ Presentation ] [ abstract ] In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as the performance degradation is tolerable. For guest users, free resources come at the cost of unpredictable “failures”, where failures are defined as disruption in the guest job’s execution due to contention from the processes of the machine owner or the conventionally understood hardware and software failures. These unpredictable failures lead to unpredictable completion times. Checkpoint-recovery has long been used for providing reliability in failureprone computing environments. Today’s production FGCS systems, such as Condor, use expensive, high-performance dedicated checkpoint servers, even though they could take advantage of free disk resources offered by the clusters’ commodity machines. Also, in large, geographically distributed clusters, dedicated checkpoint servers may incur high checkpoint transfer latencies. In this paper we consider using available, free disk resources as shared storage hosts for serving as checkpoint repositories. Doing so raises new challenges in providing fault-tolerance, because a failing storage host may lead to a loss of saved application states. We model failures of such shared storage hosts and develop a prediction algorithm for such failures and then choosing an appropriate set of storage hosts. We describe the development of our system called Falcon in the production university-wide Condor testbed at Purdue University, named “BoilerGrid”. Through experiments on BoilerGrid, we show that Falcon provides improved and consistent performance to guest jobs by ensuring reliability and robustness in the presence of irregular resource availability. "A Tale of Two Synchronizing Clocks": Jinkyu Koo, Rajesh Krishna Panta, Saurabh Bagchi, and Luis Montestruque. In: 7th ACM Conference on Embedded Networked Sensor Systems (SenSys), November 4-6, 2009, Berkeley, California. (Acceptance rate: 21/119 = 17.6%) [ Presentation ] [ abstract ] A specific application for wastewater monitoring and actuation, called SwimNet, deployed city-wide in a mid-sized US city, posed some challenges to a time synchronization protocol. The nodes in SwimNet have a low duty cycle (2% in current deployment) and use an external clock, called the Real Time Clock (RTC), for triggering the sleep and the wake-up. The RTC has a very low drift (2 ppm) over the wide range of temperature fluctuations that the SwimNet nodes have, while having a low power consumption (0.66 mW). However, these clocks will still have to be synchronized occasionally during the long lifetime of the Swim-Net nodes and this was the problem we confronted with our time synchronization protocol. The RTC to fit within the power and the cost constraints makes the tradeoff of having a coarse time granularity of only 1 second. Therefore, it is not sufficient to synchronize the RTC itself—that would mean a synchronization error of up to 1 second would be possible even with a perfect synchronization protocol. This would be unacceptable for the low duty cycle operation—each node stays awake for only 6 seconds in a 5 minute time window. This was the first of three challenges for time synchronization. The second challenge is that the synchronization has to be extremely fast since ideally the entire network should be synchronized during the 6 second wake-up period. Third, the long range radio used for the metropolitan-scale SwimNet does not make its radio stack software available, as is seen with several other radios for long-range ISM band RF communication. Therefore, a common technique for time synchronization—MAC layer time-stamping—cannot be used. Additionally, MAC layer time-stamping is known to be problematic with high speed radios (even at 250 kbps). We solve these challenges and design a synchronization protocol called HARMONIA. It has three design innovations. First, it uses the finely granular microcontroller clock to achieve synchronization of the RTC, such that the synchronization error, despite the coarse granularity of the RTC, is in the microsecond range. Second, HARMONIA pipelines the synchronization messages through the network resulting in fast synchronization of the entire network. Third, HARMONIA provides failure handling for transient node and link failures such that the network is not overburdened with synchronization messages and the recovery is done locally. We evaluate HARMONIA on SwimNet nodes and compare the two metrics of synchronization error and synchronization speed with FTSP. It performs slightly worse in the former and significantly better in the latter. "Multigrade Security Monitoring for Ad-Hoc Wireless Networks": Matthew Tan Creti, Matthew Beaman, Saurabh Bagchi, Zhiyuan Li, and Yung-Hsiang Lu. In: 6th IEEE International Conference on Mobile Ad-hoc and Sensor Systems (MASS), October 12-15, 2009, Macau SAR, China. (Acceptance rate: 62/245 = 25.3%). [ Presentation ] [ abstract ] Ad-hoc wireless networks are being deployed in critical applications that require protection against sophisticated adversaries. However, wireless routing protocols, such as the widely-used AODV, are often designed with the assumption that nodes are benign. Cryptographic extensions such as Secure AODV (SAODV) protect against some attacks but are still vulnerable to easily-performed attacks using colluding adversaries, such as the wormhole attack. In this paper, we make two contributions to securing routing protocols. First, we present a protocol called Route Verification (RV) that can detect and isolate malicious nodes involved in routingbased attacks with very high likelihood. However, RV is expensive in terms of energy consumption due to its radio communications. To remedy the high energy cost of RV, we make our second contribution. We propose a multigrade monitoring (MGM) approach. The MGM approach employs a previously developed lightweight local monitoring technique to detect any necessary condition for an attack to succeed. However, local monitoring suffers from false positives due to collisions on the wireless channel. When a necessary condition is detected, the heavy-weight RV protocol is triggered. We show through simulation that MGM applied to AODV generally requires little extra energy compared to baseline AODV, under the common case where there is no attack present. It is also more resource-efficient and powerful than SAODV in detecting attacks. Our work, for the first time, lays out the framework of multigrade monitoring, which we believe fundamentally addresses the tension between security and resource consumption in ad-hoc wireless networks. "UnMask: Utilizing Neighbor Monitoring for Attack Mitigation in Multihop Wireless Sensor Networks": Issa Khalil, Saurabh Bagchi, Cristina N.-Rotaru, and Ness Shroff. In Elsevier Ad Hoc Networks Journal, notification of acceptance: June 2009. [ abstract ] Sensor networks enable a wide range of applications in both military and civilian domains. However, the deployment scenarios, the functionality requirements, and the limited capabilities of these networks expose them to a wide-range of attacks against control traffic (such as wormholes, rushing, Sybil attacks, etc) and data traffic (such as selective forwarding). In this paper we propose a framework called UNMASK that mitigates such attacks by detecting, diagnosing, and isolating the malicious nodes. UNMASK uses as a fundamental building block the ability of a node to oversee its neighboring nodes’ communication. On top of UNMASK, we build a secure routing protocol, LSR, that provides additional protection against malicious nodes by supporting multiple node-disjoint paths. We analyze the security guarantees of UNMASK and use ns-2 simulations to show its effectiveness against representative control and data attacks. The overhead analysis we present shows that UNMASK is a lightweight protocol appropriate for securing resource constrained sensor networks. "Zephyr: Efficient Incremental Reprogramming of Sensor Nodes using Function Call Indirections and Difference Computation": Rajesh Krishna Panta, Saurabh Bagchi, and Samuel P. Midkiff. At the USENIX Annual Technical Conference (USENIX '09), June 14-19, 2009, pp. 411-424, San Diego, California. (Acceptance rate: 32/191 = 16.8%). [ Presentation ] [ abstract ] Wireless reprogramming of sensor nodes is an essential requirement for long-lived networks since the software functionality changes over time. The amount of information that needs to be wirelessly transmitted during reprogramming should be minimized since reprogramming time and energy depend chiefly on the amount of radio transmissions. In this paper, we present a multihop incremental reprogramming protocol called Zephyr that transfers the delta between the old and the new software and lets the sensor nodes rebuild the new software using the received delta and the old software. It reduces the delta size by using application-level modifications to mitigate the effects of function shifts. Then it compares the binary images at the byte-level with a novel method to create small delta, that is then sent over the wireless network to all the nodes. For a wide range of software change cases that we experimented with, we nd that Zephyr transfers 1.83 to 1987 times less traffic through the network than Deluge, the standard reprogramming protocol for TinyOS, and 1.14 to 49 times less than an existing incremental reprogramming protocol by Jeong and Culler. Efficient Wireless Reprogramming through Reduced Bandwidth Usage and Opportunistic Sleeping": Rajesh Krishna Panta, Saurabh Bagchi and Issa M. Khalil. In Elsevier Ad Hoc Networks Journal, Volume 7, Issue 1, pp. 42-62, January 2009. [ abstract ] Wireless reprogramming of a sensor network is useful for uploading new code or for changing the functionality of existing code. Reprogramming may be done multiple times during a node’s lifetime and therefore a node has to remain receptive to future code updates. Existing reprogramming protocols, including Deluge, achieve this by bundling the reprogramming protocol and the application as one code image which is transferred through the network. The reprogramming protocol being complex, the overall size of the program image that needs to be transferred over the wireless medium increases, thereby increasing the time and energy required for reprogramming a network. We present a protocol called Stream that significantly reduces this bloat by using the facility of having multiple code images on the node. It pre-installs the reprogramming protocol as one image and equips the application program with the ability to listen to new code updates and switch to this image. For a sample application, the increase in size of the application image is 1 page (48 packets of 36 bytes each) for Stream and 11 pages for Deluge. Additionally, we design an opportunistic sleeping scheme whereby nodes can sleep during the period when reprogramming has been initiated but has not yet reached the neighborhood of the node. The savings become significant for large networks and for frequent reprogramming. We implement Stream on Mica2 motes and conduct testbed and simulation experiments to compare delay and energy consumption for different network sizes with respect to the state-of-the-art Deluge protocol. "Optimal Monitoring in Multi-Channel Multi-Radio Wireless Mesh Networks": Dong-Hoon Shin and Saurabh Bagchi. At the 10h ACM International Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc 2009), May 18-21, 2009, pp. 229-238, New Orleans, LA. (Acceptance rate: 31/175 = 17.7%) [ abstract ] Wireless mesh networks (WMN) are finding increasing usage in city-wide deployments for providing network connectivity. Mesh routers in WMNs typically use multiple wireless channels to enhance the spatial-reuse of frequency bands, often with multiple radios per node. Due to the cooperative nature of WMNs, they are susceptible to many attacks that cannot be defeated by using traditional cryptographic mechanisms of authentication or encryption alone. A solution approach commonly used for defending against such attacks is behavior-based detection in which some nodes overhear communication in their neighborhood to determine if the behavior by a neighbor is legitimate. It has been proposed to use specialized monitoring nodes deployed strategically throughout the network for performing such detection. The problem that arises is where to deploy these monitoring nodes, how to minimize their number, and which channels to tune their radios to, such that the maximum part of the network can be covered. This problem has been solved for single channel networks by a greedy approximation algorithm since the exact solution is NP-hard. The greedy algorithm achieves the best performance, in terms of the worst case, possible among all polynomial-time algorithms provided that P != NP. In this paper, we solve the problem for multi-channel multi-radio WMNs. The intuitive extension of the greedy algorithm destroys the property of best performance. Instead, we formulate the problem as an integer linear program, solve its linear program relaxation, and then use two rounding techniques that we develop by adapting existing rounding schemes. We thereby present two approximation algorithms. The first, computationally-light algorithm, called probabilistic rounding algorithm gives an expected best performance in the worst case. The second, called deterministic rounding algorithm achieves the best worst-case performance in a deterministic manner. To evaluate how the three algorithms perform in practice, we simulate them in random networks and scale-free networks. "Spam Detection in Voice-over-IP Calls through Semi-Supervised Clustering": Yu-Sung Wu, Saurabh Bagchi, Navjot Singh, and Ratsameetip Wita. In: 39th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Lisbon, Portugal, pp. 307-316, June 29-July 2, 2009. (Acceptance rate: 63/260 = 24.2%) [ Presentation ] [ abstract ] In this paper, we present an approach for detection of spam calls over IP telephony called SPIT in VoIP systems. SPIT detection is different from spam detection in email in that the process has to be soft real-time, fewer features are available for examination due to the difficulty of mining voice traffic at runtime, and similarity in signaling traffic between legitimate and malicious callers. Our approach differs from existing work in its adaptability to new environments without the need for laborious and errorprone manual parameter configuration. We use clustering based on the call parameters, using optional user feedback for some calls, which they mark as SPIT or non-SPIT. We improve on a popular algorithm for semi-supervised learning, called MPCK-Means, to make it scalable to a large number of calls and operate at runtime. Our evaluation on captured call traces shows a fifteen fold reduction in computation time, with improvement in detection accuracy. "Covert TCP/IP Timing Channels: Theory to Implementation": Sarah Sellke, Chih-Chun Wang, Saurabh Bagchi, Ness Shroff. 28th Annual IEEE Conference on Computer Communications (INFOCOM), pp. 2204-2212, April 19-25 2009, Rio de Janeiro, Brazil. (Acceptance rate: 282/1435 = 19.7%). [ Presentation ] [ abstract ] There has been significant recent interest in covert communication using timing channels. In network timing channels, information is leaked by controlling the time between transmissions of consecutive packets. Our work focuses on network timing channels and provides two main contributions. The first is to quantify the threat posed by covert network timing channels. The other is to use timing channels to communicate at a low data rate without being detected. In this paper, we design and implement a covert TCP/IP timing channel. We are able to quantify the achievable data rate (or leak rate) of such a covert channel. Moreover, we show that by sacrificing data rate, the traffic patterns of the covert timing channel can be made computationally indistinguishable from that of normal traffic, which makes detecting such communication virtually impossible. We demonstrate the efficacy of our solution by showing significant performance gains in terms of both data rate and covertness over the state-of-the-art. "Intrusion Detection in Voice-over-IP Environments": Yu-Sung Wu, Vinita Apte, Saurabh Bagchi, Sachin Garg, Navjot Singh. Elsevier International Journal of Information Security (IJIS). [ abstract ] In this article, we present the design of an intrusion detection system for VoIP networks. The first part of our work consists of a simple single-component intrusion detection system called SCIDIVE. In the second part, we extend the design of SCIDIVE and build a distributed and correlation-based intrusion detection system called SPACEDIVE. We create several attack scenarios and evaluate the accuracy and efficiency of the system in the face of these attacks. To the best of our knowledge, this is the first comprehensive look at the problem of intrusion detection in VoIP systems. It includes treatment of the challenges faced due to the distributed nature of the system, the nature of the VoIP traffic, and the specific kinds of attacks at such systems. "Hermes: Fast and Energy Efficient Incremental Code Updates for Wireless Sensor Networks": Rajesh Krishna Panta, Saurabh Bagchi. In: 28th Annual IEEE Conference on Computer Communications (INFOCOM), pp. 639-647,April 19-25 2009, Rio de Janeiro, Brazil. (Acceptance rate: 282/1435 = 19.7%) [ Presentation ] [ abstract ] Wireless reprogramming of sensor nodes is a requirement for long-lived networks due to changes in the functionality of the software running on the nodes. The amount of information that needs to be wirelessly transmitted during reprogramming should be minimized to reduce reprogramming time and energy. In this paper, we present a multi-hop incremental reprogramming protocol called Hermes that transfers the delta between the old and new software and lets the sensor nodes rebuild the new software using the received delta and the old software. It reduces the delta by using techniques to mitigate the effects of function and global variable shifts caused by the software modifications. Then it compares the binary images at the byte level with a method to create small delta that needs to be sent over the wireless network to all the nodes. For a wide range of software change scenarios that we experimented with, we find that Hermes transfers up to 201 times less information than Deluge, the standard reprogramming protocol for TinyOS and 64 times less than an existing incremental reprogramming protocol by Jeong and Culler.
2008
"Effects of Types of Active Learning Activity on Two Junior-Level Computer Engineering Courses": Saurabh Bagchi, Mark C. Johnson, and Somali Chaterji. In: 38th Annual Frontiers in Education (FIE) Conference, 6 pages, Saratoga Springs, New York, October 25-28, 2008. "Impact of Research Technologies on Service Learning": Saurabh Bagchi, Carla B. Zoltowski, and William C. Oakes. In: 38th Annual Frontiers in Education (FIE) Conference, 2 pages, Saratoga Springs, New York, October 25-28, 2008. "Search for Efficiency in Automated Intrusion Response for Distributed Applications": Yu-Sung Wu, Gaspar Modelo-Howard, Bingrui Foo, Saurabh Bagchi, Eugene Spafford. In: 27th International Symposium on Reliable Distributed Systems (SRDS), pp. 53-62, Naples, Italy, October 6-8, 2008. (Acceptance rate: 28/112 = 25%) [ presentation ] [ abstract ] Providing automated responses to security incidents in a distributed computing environment has been an important area of research. This is due to the inherent complexity of such systems that makes it difficult to eliminate all vulnerabilities before deployment and costly to rely on humans for responding to incidents in real time. Earlier works have shed the light on automated responses. They pick the best local response that stops an attack propagation from its current step to the next step. Here we propose a new approach where the optimality of responses is considered from a global point of view on “What’s the eventual outcome on the whole system from using a response?”. We formalize the process of providing automated responses and the criterion for asserting global optimality of the set of deployed responses. We show that reaching the globally optimal solution is an NP-hard problem. Therefore we design a genetic algorithm framework for searching for good solutions. In real world, good solutions can change as the problem structure changes. Here the problem structure involves the protected target system and the attacks, both of which can change over time. Our framework constantly adapts itself to the changing environment based on short term history and also tracks the patterns of attacks in a long-term history. We demonstrate the solution on a distributed e-commerce application called Pet Store with injection of real attacks and show that it improves the survivability of the system over previous works. "SeNDORComm: An Energy-Efficient Priority-Driven Communication Layer for Reliable Wireless Sensor Networks": Vinai Sundaram, Saurabh Bagchi, Yung-Hsiang Lu, and Zhiyuan Li. In: 27th International Symposium on Reliable Distributed Systems (SRDS), pp. 23-32, Naples, Italy, October 6-8, 2008. (Acceptance rate: 28/112 = 25%) [ presentation ] [ abstract ] In many reliable Wireless Sensor Network (WSN) applications, messages have different priorities depending on urgency or importance. For example, a message reporting the failure of all nodes in a region is more important than that for a single node. Moreover, traffic can be bursty in nature, such as when a correlated error is reported by multiple nodes running identical code. Current communication layers in WSNs lack efficient support for these two requirements. We present a priority-driven communication layer, called SeNDORComm, which schedules transmission of packets driven by application-specified priority, buffers and packs multiple messages in a packet, and honors the latency guarantee for a message. We show that SeNDORComm improves energy efficiency, message reliability, and network utilization and delays congestion in a network. We extensively evaluate SeNDORComm using analysis, simulation, and testbed experiments. We demonstrate the improvement in goodput of SeNDORComm over the default communication layer, GenericComm in TinyOS (134.78% for a network of 20 nodes). "Determining Placement of Intrusion Detectors for a Distributed Application through Bayesian Network Modeling": Gaspar Modelo-Howard, Saurabh Bagchi, Guy Lebanon. In: 11th International Symposium on Recent Advances in Intrusion Detection (RAID), pp. 271-290, Boston, MA, September 15-17, 2008. (Acceptance rate: 20/80 = 25%) [ presentation ] [ abstract ] To secure today’s computer systems, it is critical to have different intrusion detection sensors embedded in them. The complexity of distributed computer systems makes it difficult to determine the appropriate configuration of these detectors, i.e., their choice and placement. In this paper, we describe a method to evaluate the effect of the detector configuration on the accuracy and precision of determining security goals in the system. For this, we develop a Bayesian network model for the distributed system, from an attack graph representation of multi-stage attacks in the system. We use Bayesian inference to solve the problem of determining the likelihood that an attack goal has been achieved, given a certain set of detector alerts. We quantify the overall detection performance in the system for different detector settings, namely, choice and placement of the detectors, their quality, and levels of uncertainty of adversarial behavior. These observations lead us to a greedy algorithm for determining the optimal detector settings in a large-scale distributed system. We present the results of experiments on Bayesian networks representing two real distributed systems and real attacks on them. "MISPAR: Mitigating Stealthy Packet Dropping in Locally-Monitored Multi-hop Wireless Ad Hoc Networks": Issa Khalil and Saurabh Bagchi. In: 4th International Conference on Security and Privacy in Communication Networks (SecureComm), 10 pages, Istanbul, Turkey, September 22-25, 2008. (Acceptance rate: 26/124 = 21%) [ abstract ] Local monitoring has been demonstrated as a powerful technique for mitigating security attacks in multi-hop ad-hoc networks. In local monitoring, nodes overhear partial neighborhood communication to detect misbehavior such as packet drop or delay. However, local monitoring as presented in the literature is vulnerable to a class of attacks that we introduce here called stealthy packet dropping. Stealthy packet dropping disrupts the packet from reaching the destination by malicious behavior at an intermediate node. However, the malicious node gives the impression to its neighbors that it performed the legitimate forwarding action. Moreover, a legitimate node comes under suspicion. We introduce four ways of achieving stealthy packet dropping, none of which is currently detectable. We provide a protocol called MISPAR based on local monitoring to remedy each attack. It presents two techniques – having the neighbors maintain additional information about the routing path, and adding some checking responsibility to each neighbor. We show through analysis and simulation that the basic local monitoring fails to mitigate any of the presented attacks while MISPAR successfully mitigates them. "Single versus Multi-hop Wireless Reprogramming in Sensor Networks": Rajesh Krishna Panta, Issa Khalil, Saurabh Bagchi, Luis Montestruque. In: 4th International Conference on Testbeds and Research Infrastructures for the Development of Networks & Communities (Tridentcom), 7 pages, Innsbruck, Austria, March 18-20, 2008. [ Presentation ] [ abstract ] Wireless reprogramming of the sensor network is useful for uploading new code or for changing the functionality of the existing code. In recent years, the research focus has shifted from single hop reprogramming to multi-hop reprogramming primarily because of its ease of use. Practical experience from a multi-hop sensor network for monitoring water pollution, called CSOnet, deployed in South Bend, IN, indicates that single-hop reprogramming may be preferable under certain conditions to minimize reprogramming time and energy. In this, the user gets close to a node to be reprogrammed and wirelessly reprograms a single node at a time. The choice between single hop and multi-hop reprogramming depends on factors like network size, node density and most importantly, link reliabilities. We present a protocol called DStream having both single and multi-hop reprogramming capabilities. We provide mathematical analysis and results from testbed experiments (including experiments conducted on CSOnet networks) and simulations to give insights into the choice of the two reprogramming methods for various network parameters. "Optimizing AES for Embedded Devices and Wireless Sensor Networks": Shammi Didla, Aaron Ault and Saurabh Bagchi. In: 4th International Conference on Testbeds and Research Infrastructures for the Development of Networks & Communities (Tridentcom), pp. 1-10, Innsbruck, Austria, March 18-20, 2008. [ Presentation ] [ abstract ] The increased need for security in embedded applications in recent years has prompted efforts to develop encryption/decryption algorithms capable of running on resource-constrained systems. The inclusion of the Advanced Encryption Standard (AES) in the IEEE 802.15.4 Zigbee protocol has driven its widespread use in current embedded platforms.We propose an implementation of AES in a high-level language (C in this case) that is the first software-based solution for 16-bit microcontrollers capable of matching the communication rate of 250 kbps specified by the Zigbee protocol, while also minimizing RAM and ROM usage. We discuss a series of optimizations and their effects that lead to our final implementation achieving an encryption speed of 286 kbps, RAM usage of 260 bytes, and code size of 5160 bytes on the Texas Instruments MSP430 microprocessor. We also develop rigorous benchmark experiments to compare other AES implementations on a common platform, and show that our implementation outperforms the best available implementation by 85%.
2007
"Stateful Detection in High Throughput Distributed Systems": Gunjan Khanna, Ignacio Laguna, Fahad A. Arshad, and Saurabh Bagchi. In: 26th IEEE International Symposium on Reliable Distributed Systems (SRDS-2007), pp. 275-287, Beijing, CHINA, October 10-12, 2007. (Acceptance rate: 29/185 ~ 15.7%) [ Presentation ] [ abstract ] With the increasing speed of computers and the complexity of applications, many of today’s distributed systems exchange data at a high rate. Significant work has been done in error detection achieved through external fault tolerance systems. However, the high data rate coupled with complex detection can cause the capacity of the fault tolerance system to be exhausted resulting in low detection accuracy. We present a new stateful detection mechanism which observes the exchanged application messages, deduces the application state, and matches against anomalybased rules. We extend our previous framework (the Monitor) to incorporate a sampling approach which adjusts the rate of verified messages. The sampling approach avoids the previously reported breakdown in the Monitor capacity at high application message rates, reduces the overall detection cost and allows the Monitor to provide accurate detection. We apply the approach to a reliable multicast protocol (TRAM) and demonstrate its performance by comparing it with our previous framework. "Distributed Diagnosis of Failures in a Three Tier E-Commerce System": Gunjan Khanna, Ignacio Laguna, Fahad A. Arshad, and Saurabh Bagchi. In: 26th IEEE International Symposium on Reliable Distributed Systems (SRDS-2007), pp. 185-198, Beijing, CHINA, October 10-12, 2007. (Acceptance rate: 29/185 ~ 15.7%) [ Presentation ] [ abstract ] For dependability outages in distributed internet infrastructures, it is often not enough to detect a failure, but it is also required to diagnose it, i.e., to identify its source. Complex applications deployed in multi-tier environments make diagnosis challenging because of fast error propagation, black-box applications, high diagnosis delay, the amount of states that can be maintained, and imperfect diagnostic tests. Here, we propose a probabilistic diagnosis model for arbitrary failures in components of a distributed application. The monitoring system (the Monitor) passively observes the message exchanges between the components and, at runtime, performs a probabilistic diagnosis of the component that was the root cause of a failure. We demonstrate the approach by applying it to the Pet Store J2EE application, and we compare it with Pinpoint by quantifying latency and accuracy in both systems. The Monitor outperforms Pinpoint by achieving comparably accurate diagnosis with higher precision in shorter time. "Energy-efficient, On-demand Reprogramming of Large-Scale Sensor Networks": Mark D. Krasniewski, Rajesh K. Panta, Saurabh Bagchi, Chin-Lung Yang, and William J. Chappell. In: ACM Transactions on Sensor Networks (TOSN), notification of acceptance: June 2007. [ abstract ] As sensor networks operate over long periods of deployment in difficult to reach places, their requirements may change or new code may need to be uploaded to them. The current state of the art protocols (Deluge and MNP) for network reprogramming perform the code dissemination in a multi-hop manner using a three way handshake whereby meta-data is exchanged prior to code exchange to suppress redundant transmissions. The code image is also pipelined through the network at the granularity of pages. In this paper we propose a protocol called Freshet for optimizing the energy for code upload and speeding up the dissemination if multiple sources of code are available. The energy optimization is achieved by equipping each node with limited non-local topology information, which it uses to determine the time when it can go to sleep since code is not being distributed in its vicinity. The protocol to handle multiple sources provides a loose coupling of nodes to a source and disseminates code in waves each originating at a source, with mechanism to handle collisions when the waves meet. The protocol’s performance with respect to reliability, delay, and energy consumed, is demonstrated through analysis, simulation, and implementation on the Berkeley mote platform. "Capacity Bounds on Timing Channels with Bounded Service Times": Sarah H. Sellke, Chih-Chun Wang, Ness Shroff, and Saurabh Bagchi. In: IEEE International Symposium on Information Theory, pp. 981-985, Nice, France, June 24-29, 2007. [ abstract ] It is well known that queues with exponentially distributed service times have the smallest Shannon capacity among all single-server queues with the same service rate. In this paper, we study the capacity of timing channels in which the service time distributions have bounded support, i.e., Bounded Service Timing Channels (BSTC). We derive an upper bound and two lower bounds on the capacity of such timing channels. The tightness of these bounds is investigated analytically as well as via simulations. We find that the uniform BSTC serves a role for BSTCs that is similar to what the exponential service timing channel does for the case of timing channels with unbounded service time distributions. That is, when the length of the support interval is small, the uniform BSTC has the smallest capacity among all BSTCs. "Failure-Aware Checkpointing in Fine-Grained Cycle Sharing Systems": Xiaojuan Ren, Rudolf Eigenmann, and Saurabh Bagchi. In: 16th IEEE International Symposium on High Performance Distributed Computing (HPDC-16), Monterey Bay, California, June 27-29, 2007. (Acceptance rate: 20%). [ Presentation ] [ abstract ] Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amount of idle computational resources available on the Internet. Such systems allow guest jobs to run on a host if they do not significantly impact the local users of the host. Since the hosts are typically provided voluntarily, their availability fluctuates greatly. To provide fault tolerance to guest jobs without adding significant computational overhead, we propose failure-aware checkpointing techniques that apply the knowledge of resource availability to select checkpoint repositories and to determine checkpoint intervals. We present the schemes of selecting reliable and efficient repositories from the non-dedicated hosts that contribute their disk storage. These schemes are formulated as 0/1 programming problems to optimize the network overhead of transferring checkpoints and the work lost due to unavailability of a storage host when needed to recover a guest job. We determine the checkpoint interval by comparing the cost of checkpointing immediately and the cost of delaying that to a later time, which is a function of the resource availability. We evaluate the failure-aware techniques on an FGCS system called iShare, using tracebased simulation. The results show that our techniques achieve better application performance than the prevalent methods which use checkpointing with a fixed periodicity on dedicated checkpoint servers. "SLAM: Sleep-Wake Aware Local Monitoring in Sensor Networks", Issa Khalil, Saurabh Bagchi, and Ness B. Shroff. IEEE Symposium on Dependable Systems and Networks (DSN), June 25-28, 2007, pp. 565-574, Edinburgh, Ireland. [ Presentation ] [ abstract ] Sleep-wake protocols are critical in sensor networks to ensure long-lived operation. However, an open problem is how to develop efficient mechanisms that can be incorporated with sleep-wake protocols to ensure both longlived operation and a high degree of security. Our contribution in this paper is to address this problem by using local monitoring, a powerful technique for detecting and mitigating control and data attacks in sensor networks. In local monitoring, each node oversees part of the traffic going in and out of its neighbors to determine if the behavior is suspicious, such as, unusually long delay in forwarding a packet. Here, we present a protocol called SLAM to make local monitoring parsimonious in its energy consumption and to integrate it with any extant sleep-wake protocol in the network. The challenge is to enable sleep-wake in a secure manner even in the face of nodes that may be adversarial and not wake up nodes responsible for monitoring its traffic. We prove analytically that the security coverage is not weakened by the protocol. We perform simulations in ns-2 to demonstrate that the performance of local monitoring is practically unchanged while listening energy saving of 30 to 129 times is achieved, depending on the network load. "Fault Tolerant ARIMA-based Aggregation of Data in Sensor Networks": Doug Herbert, Gaspar Modelo-Howard, Carlos Perez-Toro, Saurabh Bagchi. Fast Abstract in the Supplemental Proceedings of the International Conference on Dependable Systems and Networks (DSN), June 25-28, 2007, Edinburgh, Ireland. [ introduction ] Sensor networks collect data using robust, energy efficient, low bandwidth protocols due to the few resources available to them. Among the many methods explored to improve the efficiency of sensor networks for data collection, researchers have proposed statistical methods to predict when each sensor should send data. Two shortcomings to this approach are: (1) the use of a static model, assuming unchanging environmental parameters to the sensor network and (2) the lack of an error detection mechanism to compensate for failures during the transmission of data. Suppressing data transmissions by using the fact that data follows a statistical model has been proposed before, with the assumption that the environmental parameters do not change. This assumption does not fit several scenarios where sensor networks can be deployed. Additionally, recent advancements in sensor networks allow for more powerful (CPU and memory) on sensing nodes. This represents an opportunity to extend the role of end nodes, by allowing them to perform statistical analysis on the collected data at runtime and updating model parameters in real time. In this manner, we can create a self adapting model. Gunjan Khanna, “Non Intrusive Detection and Diagnosis of Failures in High Throughput Distributed Systems”. Defended: June 11, 2007. [ Thesis abstract ], [ Thesis ] and [ Final presentation slides ]. "Intrusion Response Systems: A Survey": Bingrui Foo, Matthew W. Glause, Gaspar Modelo-Howard, Yu-Sung Wu, Saurabh Bagchi, and Eugene Spafford. Book chapter in "Information Assurance: Dependability and Security in Networked Systems", pp. 377-416, Morgan Kaufmann Publishers. Publication date: Fall 2007. [ abstract ] Protecting networks from computer security attacks is an important concern of computer security. Within this, intrusion prevention and intrusion detection systems have been the subject of much study and have been covered in several excellent survey papers. However, the actions that need to follow the steps of prevention and detection, namely response, have received less attention from researchers or practitioners. It was traditionally thought of as an offline process, with humans in the loop, such as system administrators performing forensics by going through the system logs and determining which services or components need to be recovered. Our systems today have reached a level of complexity and the attacks directed at them a level of sophistication that manual responses are no longer adequate. So far there has been limited work in autonomous intrusion response systems, especially work that provides rigorous analysis or generalizable system building techniques. The work that exists has not been surveyed previously. In this survey paper, we lay out the design challenges in building autonomous intrusion response systems. Then we provide a classification of existing work on the topic into four categories – response through static decision tables, response through dynamic decision process, intrusion tolerance through diverse replicas, and intrusion response for specific classes of attacks. We describe the existing intrusion response systems after classifying them. We present methods for benchmarking the intrusion response systems. We end with ideas for further work in the field. "Adaptive Correctness Monitoring for Wireless Sensor Networks Using Hierarchical Distributed Run-Time Invariant Checking": Douglas Herbert, Vinaitheerthan Sundaram, Yung-Hsiang Lu, Saurabh Bagchi, and Zhiyuan Li. In: ACM Transactions on Autonomous and Adaptive Systems (TAAS), notification of acceptance: May 2007. [ abstract ] This paper presents a hierarchical approach for detecting faults in wireless sensor networks (WSNs) after they have been deployed. The developers of WSNs can specify “invariants” that must be satisfied by the WSNs. We present a framework, Hierarchical SEnsor Network Debugging (H-SEND), for lightweight checking of invariants. H-SEND is able to detect a large class of faults in data gathering WSNs and leverages the existing message flow in the network by buffering and piggybacking messages. H-SEND checks as closely to the source of a fault as possible, pinpointing the fault quickly and efficiently in terms of additional network traffic. Therefore, H-SEND is suited to bandwidth or communication energy constrained networks. A specification expression is provided for specifying invariants so that a protocol developer can write behavioral level invariants. We hypothesize that data from sensor nodes does not change dramatically, but rather changes gradually over time. We extend our framework for the invariants that include values determined at run time in order to detect the violation of data trends. The value range can be based on information local to a single node or the surrounding nodes’ values. Using our system, developers can write invariants to detect data trends without prior knowledge of correct values. Automatic value detection can be used to detect anomalies that were not previously possible. To demonstrate the benefits of run-time range detection and fault checking, we construct a prototype WSN using CO2 and temperature sensors coupled to Mica2 motes. We show that our method can detect sudden changes of the environments with little overhead in communication, computation, and storage. "Automated Rule-Based Diagnosis through a Distributed Monitor System": Gunjan Khanna, Mike Yu Cheng, Padma Varadharajan, Saurabh Bagchi, Miguel P. Correia, and Paulo J. Verissimo. In: IEEE Transactions on Dependable and Secure Computing (TDSC), notificacion of acceptance: May 2007. [ abstract ] In today's world where distributed systems form many of our critical infrastructures, dependability outages are becoming increasingly common. In many situations, it is necessary to not just detect a failure, but also to diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challenging since high throughput applications with frequent interactions between the different components allow fast error propagation. It is desirable to consider applications as black-boxes for the diagnostic process. In this paper, we propose a Monitor architecture for diagnosing failures in large-scale network protocols. The Monitor only observes the message exchanges between the protocol entities (PEs) remotely and does not access internal protocol state. At runtime, it builds a causal graph between the PEs based on their communication and uses this together with a rule base of allowed state transition paths to diagnose the failure. The tests used for the diagnosis are based on the rule base and are assumed to have imperfect coverage. The hierarchical Monitor framework allows distributed diagnosis handling failures at individual Monitors. The framework is implemented and applied to a reliable multicast protocol executing on our campus-wide network. Fault injection experiments are carried out to evaluate the accuracy and latency of the diagnosis. "Stream: Low Overhead Wireless Reprogramming for Sensor Networks", Rajesh Krishna Panta, Issa Khalil, and Saurabh Bagchi. 26th Annual IEEE Conference on Computer Communications (INFOCOM), May 6-12 2007, Anchorage, Alaska, USA. (Acceptance rate: 252/~1400 = 18%) [ Presentation ] [ abstract ] Wireless reprogramming of a sensor network is useful for uploading new code or for changing the functionality of existing code. Through the process, a node should remain receptive to future code updates because reprogramming may be done multiple times during the node’s lifetime. Existing reprogramming protocols, such as Deluge, achieve this by bundling the reprogramming protocol and the application as one program image, thereby increasing the overall size of the image which is transferred through the network. This increases both time and energy required for network reprogramming. We present a protocol called Stream that mitigates the problem by significantly reducing the size of the program image. Using the facility of having multiple code images on a node and switching between them, Stream pre-installs the reprogramming protocol as one image and the application program equipped with the ability to listen to new code updates as the second image. For a sample application, Stream reduces the size of the program image by 10 pages (48 packets/page) compared to Deluge. Stream is implemented on the Mica2 sensor nodes and we conduct testbed and simulation experiments to show the reduction in energy and reprogramming time of Stream compared to Deluge. "Data-Centric Routing in Sensor Networks: Single-hop Broadcast or Multi-hop Unicast", Xuan Zhong, Ravish Khosla, Gunjan Khanna , Saurabh Bagchi and Edward J. Coyle. IEEE 65th Vehicular Technology Conference (VTC2007-Spring), April 22 - 25 2007. (Acceptance rate: 685/1443 ~ 47.4%) [ abstract ] Data dissemination strategies and communication protocols that minimize the use of energy can significantly prolong the lifetime of a sensor network. Data-centric dissemination strategies seek energy efficiency by employing short metadata descriptions in advertisements (ADVs) of the availability of data, short requests (REQs) to obtain the data by nodes that are interested in it, and data transmissions (DATA) to deliver data to the requesting nodes. An important decision in this process is whether the DATA transmission should be made at full power in broadcast mode or at low power in multi-hop unicast mode. The determining factor is shown in this paper to be the fraction of nodes that are interested in the DATA, as shown by the number of REQs that are generated. Closed form expressions for this critical fraction of interested nodes is derived when the nodes have no memory or infinite memory for state information and when transmissions are reliable and not reliable. These results can be used during both the design and operation of the network to increase energy efficiency and network longevity. "Analysis and Evaluation of SECOS, a Protocol for Energy Efficient and Secure Communication in Sensor Networks”, Issa Khalil, Saurabh Bagchi, and Ness B. Shroff. Elsevier Ad-Hoc Networks Journal, Volume 5, Issue 3, pp. 360-391, April 2007. [ abstract ] Wireless sensor networks are increasingly being used in applications where the communication between nodes needs to be protected from eavesdropping and tampering. Such protection is typically provided using techniques from symmetric key cryptography. The protocols in this domain suffer from one or more of the following problems - weak security guarantees if some nodes are compromised, lack of scalability, high energy overhead for key management, and increased end-to-end data latency. In this paper, we propose a protocol called SECOS that mitigates these problems in static sensor networks. SECOS divides the sensor field into control groups each with a control node. Data exchange between nodes within a control group happens through the mediation of the control head which provides the common key. The keys are refreshed periodically and the control nodes are changed periodically to enhance security. SECOS enhances the survivability of the network by handling compromise and failures of control nodes. It provides the guarantee that the communication between any two sensor nodes remains secure despite the compromise of any number of other nodes in the network. The experiments based on a simulation model show a seven time reduction in energy overhead and a 50% reduction in latency compared to SPINS, which is one of the state-of-the-art protocols for key management in sensor networks. "Automated Adaptive Intrusion Containment in Systems of Interacting Services", Yu-Sung Wu, Bingrui Foo, Yu-ChunMao, Saurabh Bagchi, Eugene Spafford. Elsevier Journal of Computer Networks, Volume 51, Issue 5, pp. 1334-1360, April 2007. [ abstract ] Large scale distributed systems typically have interactions among different services that create an avenue for propagation of a failure from one service to another. The failures being considered may be the result of natural failures or malicious activity, collectively called disruptions. To make these systems tolerant to failures it is necessary to contain the spread of the occurrence automatically once it is detected. The objective is to allow certain parts of the system to continue to provide partial functionality in the system in the face of failures. Real world situations impose several constraints on the design of such a disruption tolerant system of which we consider the following – the alarms may have type I or type II errors; it may not be possible to change the service itself even though the interaction may be changed; attacks may use steps that are not anticipated a priori; and there may be bursts of concurrent alarms. We present the design and implementation of a system named ADEPTS as the realization of such a disruption tolerant system. ADEPTS uses a directed graph representation to model the spread of the failure through the system, presents algorithms for determining appropriate responses and monitoring their effectiveness, and quantifies the effect of disruptions through a high level survivability metric. ADEPTS is demonstrated on a real e-commerce testbed with actual attack patterns injected into it. "Performance Comparison of SPIN based Push-Pull Protocols", Ravish Khosla, Xuan Zhong, Gunjan Khanna, Saurabh Bagchi,and Edward J. Coyle. IEEE Wireless Communications and Networking Conference (WCNC), Mar 11-15, 2007, Hong Kong. (Acceptance rate: 48%) [ abstract ] Multiple data-centric protocols - which can broadly be classified as push-pull, push-only, or pull-only - have been proposed in the literature. In this paper we present a framework to develop an insight into the characteristics of push-pull protocols. The performance of push-pull protocols is critically dependent on the time-out settings used to trigger failure recovery mechanisms. We perform a study of how to choose optimal timeouts to achieve best performance. Our starting point is a recently proposed SPIN-based protocol, called Shortest-Path Minded SPIN (SPMS), in which meta-data negotiations take place prior to data exchange in order to minimize the number of data transmissions. We propose a redesign of SPMS, called SPMS-Rec, which reduces the energy expended in the event of failures by requiring intermediate relay nodes to try alternate routes. Our simulation results show that SPMS-Rec outperforms SPMS, and thus SPIN, yielding energy savings while reducing the delay when multiple nodes fail along a route. We further propose a modification to SPMS-Rec through request suppression which helps in reducing redundant data transmissions. "Prediction of Resource Availability in Fine-Grained Cycle Sharing Systems and Empirical Evaluation", Xiaojuan Ren, Seyong Lee, Rudolf Eigenmann, and Saurabh Bagchi. In Springer’s Journal of Grid Computing (JOGC), vol. 5, no. 2, pp. 173-195, 2007. [ abstract ] Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amount of computational resources available on the Internet. In FGCS, host computers allow guest jobs to utilize the CPU cycles if the jobs do not significantly impact the local users. Such resources are generally provided voluntarily and their availability fluctuates highly. Guest jobs may fail unexpectedly, as resources become unavailable. To improve this situation, we consider methods to predict resource availability. This paper presents empirical studies on resource availability in FGCS systems and a prediction method. From studies on resource contention among guest jobs and local users, we derive a multi-state availability model. The model enables us to detect resource unavailability in a non-intrusive way. We analyzed the traces collected from a production FGCS system for three months. The results suggest the feasibility of predicting resource availability, and motivate our method of applying semi-Markov Process models for the prediction. We describe the prediction framework and its implementation in a production FGCS system, named iShare. Through the experiments on an iShare testbed, we demonstrate that the prediction achieves an accuracy of 86% on average and outperforms linear time series models, while the computational cost is negligible. Our experimental results also show that the prediction is robust in the presence of irregular resource availability. We tested the effectiveness of the prediction in a proactive scheduler. Initial results show that applying availability prediction to job scheduling reduces the number of jobs failed due to resource unavailability. MOBIWORP: Mitigation of the Wormhole Attack in Mobile Multihop Wireless Networks", Issa Khalil, Saurabh Bagchi, and Ness B. Shroff. Elsevier Ad hoc Networks Journal, Volume 6, Issue 3, pp. 344-362, May 2008. [ abstract ] In multihop wireless systems, the need for cooperation among nodes to relay each other's packets exposes them to a wide range of security attacks. A particularly devastating attack is the wormhole attack, where a malicious node records control traffic at one location and tunnels it to a colluding node, possibly far away, which replays it locally. This can have an adverse effect on route establishment by preventing nodes from discovering legitimate routes that are more than two hops away. Previous works on tolerating wormhole attacks have focused only on detection and used specialized hardware, such as directional antennas or extremely accurate clocks. More recent work has addressed the problem of locally isolating the malicious nodes. However, all of this work has been done in the context of static networks due to the difficulty of secure neighbor discovery with mobile nodes. The existing work on secure neighbor discovery has limitations in accuracy, resource requirements, and applicability to ad hoc and sensor networks. In this paper, we present a countermeasure for the wormhole attack, called MOBIWORP, which alleviates these drawbacks and efficiently mitigates the wormhole attack in mobile networks. MOBIWORP uses a secure central authority (CA) for global tracking of node positions. Local monitoring is used to detect and isolate malicious nodes locally. Additionally, when sufficient suspicion builds up at the CA, it enforces a global isolation of the malicious node from the whole network. The effect of MOBIWORP on the data traffic and the fidelity of detection is brought out through extensive simulation using ns-2. The results show that as time progresses, the data packet drop ratio goes to zero with MOBIWORP due the capability of MOBIWORP to detect, diagnose and isolate malicious nodes. With an appropriate choice of design parameters, MOBIWORP is shown to completely eliminate framing of a legitimate node by malicious nodes, at the cost of a slight increase in the drop ratio. The results also show that increasing mobility of the nodes degrades the performance of MOBIWORP.

2006
Issa Khalil ,"Mitigation of Control and Data Traffic Attacks in Wireless Ad-Hoc and Sensor Networks". Defended: December 14, 2006. [ Abstract ] [ Full thesis ] "Topology Insensitive Location Determination Using Independent Estimates Through Semi-Directional Antennas" Chin-Lung Yang, Saurabh Bagchi, and William J. Chappell. IEEE Transactions on Antennas and Propagation, Volume: 54 , Issue: 11 , Part 2, pp. 3458 – 3472, Nov. 2006. "Pesticide: Using SMT Processors to Improve Performance of Pointer Bug Detection", Jin-Yi Wang, Yen-Shiang Shue, T N Vijaykumar, and Saurabh Bagchi. 24th International Conference of Computer Design (ICCD), Oct 1-4, 2006, San Jose, California, USA. "Timing Channel Capacity for Uniform and Gaussian Servers", Sarah Sellke, Ness B. Shroff, Saurabh Bagchi, and Chih-Chun Wang. Forty-Fourth Annual Allerton Conference On Communication, Control, and Computing, Sep 27-29, 2006, Allerton, IL, USA. "MOBIWORP : Mitigation of the Wormhole Attack in Mobile Multihop Wireless Networks", Issa Khalil, Saurabh Bagchi, and Ness Shroff. International Conference on Security and Privacy in Communication Networks, Aug 28-Sep 1, 2006, Baltimore, Maryland, USA. (Acceptancerate: 32/126 ~ 25.4%) [ Presentation ] "SPACEDIVE: A Distributed Intrusion Detection System for Voice-over-IP Environments", Vinita Apte, Yu-Sung Wu, Saurabh Bagchi, Sachin Garg, and Navjot Singh. Fast Abstract in the Supplemental Proceedings of the International Conference on Dependable Systems and Networks (DSN), June 25-28, 2006, Philadelphia, Pennsylvania, USA. [ Presentation ] "Providing Automated Detection of Problems in Virtualized Servers using Monitor framework", Gunjan Khanna, Saurabh Bagchi, Kirk Beaty, Andrzej Kochut, and Gautam Kar. Workshop on Applied Software Reliability (WASR) at the International Conference on Dependable Systems and Networks (DSN), June 25-28, 2006, Philadelphia, Pennsylvania, USA. [ Presentation ] "Resource Failure Prediction in Fine-Grained Cycle Sharing Systems", Xiaojuan Ren, Seyong Lee, Rudolf Eigenmann, and Saurabh Bagchi. 15th IEEE International Symposium on High Performance Distributed Computing (HPDC-15), 19-23 June 2006, Paris, France. (Acceptance rate: 24/157 ~ 15%). [ Presentation ] "Detection and Repair of Software Errors in Hierarchical Sensor Networks", Douglas Herbert, Yung-Hsiang Lu, Saurabh Bagchi, and Zhiyuan Li. IEEE International Conference on Sensor Networks, Ubiquitous, and Trustworthy Computing (SUTC2006), June 5-7, 2006, Taichung, Taiwan. (Acceptance rate: 50/210 ~ 23.8%) "Automated Online Monitoring of Distributed Applications through External Monitors", Gunjan Khanna, Padma Varadharajan, and Saurabh Bagchi. IEEE Transactions on Dependable and Secure Computing (TDSC), vol. 3, no. 2, pp. 115-129, Apr-Jun, 2006.

2005
"Probabilistic Diagnosis through Non-Intrusive Monitoring in Distributed Applications", Gunjan Khanna, Yu Cheng, Saurabh Bagchi, Miguel Correia, and Paolo Verissimo. Purdue ECE Technical Report 05-19, December 2005. "LRRM: A Randomized Reliable Multicast Protocol for Optimizing Recovery Latency and Buffer Utilization", Nipoon Malhotra, Shrish Ranjan, and Saurabh Bagchi. 24th IEEE Symposium on Reliable Distributed Systems (SRDS 2005), October 26-28, 2005, Orlando, Florida, USA.(Acceptance rate: 20/67 ~ 29.9%) [ Camera ready ]. "DICAS: Detection, Diagnosis and Isolation of Control Attacks in Sensor Networks", Issa Khalil, Saurabh Bagchi, and Cristina Nina-Rotaru. IEEE Conference on Security and Privacy for Emerging Areas in Communication Networks (SecureComm), September 5 - 9, 2005, Athens, Greece. (Acceptance rate: 32/163 ~ 19.6%). "Reliable middleware for sensor networks", Mark Daniel Krasniewski. Masters Thesis, major professor: Saurabh Bagchi. Submitted: August 2005. "Automated Monitor Based Diagnosis in Distributed Systems", Gunjan Khanna, Padma Varadharajan, Mike Cheng, and Saurabh Bagchi, Purdue ECE Technical Report 05-13, August 2005. "ADEPTS: Adaptive Intrusion Response using Attack Graphs in an E-Commerce Environment", Bingrui Foo , Yu-Sung Wu, Yu-Chun Mao, Saurabh Bagchi, and Eugene Spafford. International Conference on Dependable Systems and Networks (DSN), June 28- July 1, 2005, Yokohama, Japan. (Acceptance rate: DCCS track 54/204 ~ 26.8%) [ Camera ready ] [ Presentation ] "Modeling and Automated Containment of Worms", Sarah Sellke, Ness B. Shroff, and Saurabh Bagchi. International Conference on Dependable Systems and Networks (DSN), June 28 - July 1, 2005, Yokohama, Japan. (Acceptance rate: DCCS track 54/204 ~ 26.8%) [ Camera ready ] [ Presentation ] "TIBFIT: Trust Index Based Fault Tolerance for Arbitrary Data Faults in Sensor Networks", Mark Krasniewski, Padma Varadharajan, Bryan Rabeler, Saurabh Bagchi, and Y. Charlie Hu. International Conference on Dependable Systems and Networks (DSN), June 28 - July 1, 2005, Yokohama, Japan. (Acceptance rate: PDS track 24/115 ~ 20.9%) [ Camera ready ] [ Presentation ] "LITEWORP: A Lightweight Countermeasure for the Wormhole Attack in Multihop Wireless Networks", Issa Khalil, Saurabh Bagchi, and Ness B. Shroff. International Conference on Dependable Systems and Networks (DSN), June 28 - July 1, 2005, Yokohama, Japan. (Acceptance rate: PDS track 24/115 ~ 20.9%) [ Camera ready ] [ Presentation ] "Location Estimation in Ad-Hoc Networks with Directional Antenna", Nipoon Malhotra, Mark Krasniewski, Chin-Lung Yang, Saurabh Bagchi, and William Chappell. 25th International Conference on Distributed Computing Systems (ICDCS), June 6-9, 2005, Columbus, Ohio, USA. (Acceptance rate:<14% of 540) [ Presentation ] "Location Tracking with Directional Antennas in Wireless Sensor Networks", Chin-Lung Yang, Saurabh Bagchi, and William J. Chappell. IEEE MTT-S International Microwave Symposium (IMS), June 11-17, 2005, Long Beach, California, USA.

2004
"Self Checking Network Protocols: A Monitor Based Approach", Gunjan Khanna, Padma Varadharajan, and Saurabh Bagchi. 23rd International Symposium on Reliable Distributed Systems (SRDS 2004), October 2004. (Acceptance rate:27/117 ~ 23.1%) [ Camera Ready ] [ Presentation ] "Efficient Collection of Sensor Data in Remote Fields Using Mobile Collectors", Yuldi Tirta, Zhiyuan Li, Yung-Hsiang Lu, and Saurabh Bagchi. In Proceedings of the 13th International Conference on Computer Communications and Networks (ICCCN 2004), October 2004. (Acceptance rate: 81/200 ~ 40.5%) "Robust Location Determination in Wireless Ad-hoc Networks", Nipoon Malhotra. Masters Thesis submitted in partial fulfillment of the MSECE degree. Supervisor: Prof. Saurabh Bagchi. Submitted: August 2004. [ Presentation ] "SCIDIVE: A Stateful and Cross Protocol Intrusion Detection Architecture for Voice-over-IP Environments", Saurabh Bagchi, Yu-Sung Wu (Purdue U. , USA); Sachin Garg, Navjot Singh, Tim Tsai (Avaya Labs, USA). IEEE Dependable Systems and Networks Conference (DSN 2004), June28-July 1, 2004, Florence, Italy. (Acceptance rate: DCCS track 58/276 ~ 21%). [ Camera ready ] [ Presentation ] "Fault Tolerant Energy Aware Data Dissemination Protocol in Sensor Network", Gunjan Khanna, Saurabh Bagchi, and Yu-Sung Wu. IEEE Dependable Systems and Networks Conference (DSN 2004), June 28-July 1, 2004, Florence, Italy. (Acceptance rate: PDS track 25/101 ~ 24.8%). [ Camera ready ] [ Presentation ] "ADEPTS: Adaptive Intrusion Containment in Distributed Service Environments", Bingrui Foo, Yu-Sung Wu, Saurabh Bagchi, Gene Spafford, and Blake Matheny. CERIAS Tech Report 2004. "Controlled Mobility for Efficient Data Gathering in Sensor Networks with Passively Mobile Nodes", Yuldi Tirta, Bennett Lau, Nipoon Malhotra, Saurabh Bagchi, Zhiyuan Li, and Yung-Hsiang Lu. Sensor Network Operations, Shashi Phoha (ed.), IEEE Press, Wiley Publications, 2004. (Acceptance rate: 30/90 ~ 33.3%) "Analysis and Evaluation of Topological and Application Characteristics of Unreliable Mobile Wireless Ad-hoc Network", Serdar Cabuk, Nipoon Malhotra, Longbi Lin, Saurabh Bagchi, and Ness Shroff. 10th IEEE Pacific Rim Dependable Computing Conference (PRDC ' 04), March 2004. (Acceptance rate: 34/102 ~ 33.3%) [ Camera ready ] [ Presentation ] "Failure Handling in a Reliable Multicast Protocol for Improving Buffer Utilization and Accommodating Heterogeneous Receivers", Gunjan Khanna, John Rogers, and Saurabh Bagchi. In Proceedings of the 10th IEEE Pacific Rim Dependable Computing Conference (PRDC' 04), March 2004. (Acceptance rate: 34/102 ~ 33.3%) [ Camera ready ]

2003
"Collaborative Intrusion Detection System (CIDS): A Framework for Accurate and Efficient IDS", Yu-Sung Wu, Bingrui Foo, Yongguo Mei, and Saurabh Bagchi. 19th Annual Computer Security Applications Conference, December 8-12, 2003, Las Vegas, Nevada, USA. (Acceptance rate: 36/110 ~ 32.7% ). [ Presentation ] "Self-Checking Network Protocols: A Monitor Based Approach", Gunjan Khanna, MS Thesis. December 2003. "Light-Weight Randomized Reliable Multicasting Protocol", Nipoon Malhotra, Shrish Ranjan, and Saurabh Bagchi. Appeared in Fast Abstracts, DSN2003. "ADEPTS: Adaptive Intrusion Containment and Response using Attack Graphs in an E-Commerce Environment", Yu-Sung Wu, Bingrui Foo, Blake Matheny, Tyler Olsen, and Saurabh Bagchi. CERIAS Tech Report 2003-33. "SECOS: Key Management for Scalable and Energy Efficient Crypto On Sensors", Issa Khalil and Saurabh Bagchi. CERIAS Tech Report TR-2003-33.

Copyright notice: Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the appropriate publisher (IEEE, ACM, Elsevier, etc.)

465 Northwestern Avenue, West Lafayette, IN 47907 | dcsl@ecn.purdue.edu | +1 765 494 3510