Fireside Research Chats: Audio Conversations on research projects with DCSL researchers.
1. Distributed Intrusion Tolerant System Design
As distributed systems are deployed for running critical applications, there is an increasing need to make such systems resilient. The distributed applications running on such platforms need continuous uptime, as downtime translates directly to financial losses, loss of prestige, or endangerment of human lives. Examples of such applications abound in the domains of banking, finance, airline, and military. The systems need to be resilient to faults as we traditionally know them, having non-malicious origin, as well as to intrusions or faults created by malicious human attackers of the system. The ultimate outcome of both faults and intrusions is to cause a failure of the system or its performance degradation. However, until now, the two causes of system disruptions have been looked upon as separate areas of concern, each with its own methodology and sub-system. This increases the complexity of the overall system and does not exploit the synergy between the two tasks â€“ providing fault-tolerance and intrusion-tolerance â€“ and therefore increases the overall cost of achieving both.
In this proposal, we argue for a common approach to handling faults and intrusions, henceforth called disruptions. There is also the need for an integrated infrastructure that provides tolerance to disruptions, not piecemeal solutions such as detection alone. The different phases of disruption tolerance considered here â€“ detection, diagnosis, containment and recovery â€“ are closely coupled in their functioning and the project provides a framework for exploring the interactions to make the system as a whole resilient to disruptions. The project uses best-of-breed techniques from several areas of fault tolerance and intrusion tolerance that have been researched for long, such as network based intrusion detection tools and available database of vulnerabilities in widely used software.
The current problems we are working on are:
- Providing intrusion response to unanticipated attacks
- Using history of multi-stage attacks to optimize responses to future attacks
- Developing a distributed intrusion detection architecture for p2p VoIP systems
Eugene Spafford (CS, Purdue), Mikhail Atallah (CS, Purdue), Mike Reiter (UNC), Guy Lebanon (Georgia Tech).
Gaspar Modelo-Howard, Abhisek Pan, Amiya Maji.
Yu-Sung Wu, Bingrui Foo, Matthew Glause.
Details: Click for a more detailed discussion of the rationale and our solution approach.
Papers: See publications.
2. Intelligent Ad-Hoc Wireless Networks
Recent advances in wireless communications and electronics have enabled the development of low-cost, low-power, miniature sensor nodes. These nodes are multifunctional and capable of sensing, communication, computation and sometimes, mobility. Sensor networks are comprised of large numbers of sensor nodes placed in the environment to be monitored and communicating with each other through low-bandwidth communication links. Some of the characteristics of the sensor nodes make it challenging to harness them into usable networks. They have limited sources of power, high failure rates and limited computation and communication power. In our current research, we are investigating the issues in building sensor networks to meet high-level application requirements, such as time to completion, in the face of the constraints imposed by the nodes. Specifically, our research focuses on scheduling algorithms for communication, computation and mobility taking deployment specific knowledge into account. While the individual nodes may be prone to failures, a certain fraction of redundant nodes may be deployed. We intend to investigate protocols that exploit the redundancy in the sensor nodes to â€œroute aroundâ€ failures. Another area of focus is the design of power-aware routing protocols and integrating them with higher-level protocols controlling computation, communication and motion. In this research, we will address fundamental problems leading to proof of concept demonstrations of the developed strategies and technologies through a testbed of mobile sensor nodes running a network protocol stack, middleware and representative applications.
In another facet of this work, we are looking at protecting sensor networks from control and data attacks that do not need cryptographic keys. A particularly devastating control attack is known as the wormhole attack, where a malicious node records control and data traffic at one location and tunnels it to a colluding node far away, which replays it locally. This can either disrupt route establishment or make routes pass through the malicious nodes. We have developed a lightweight countermeasure for the wormhole attack, called LITEWORP, which relies on overhearing neighbor communication. It also isolates the malicious node thus preventing it from launching further attacks.
Another important area of work is wireless reprogramming of sensor networks. As sensor networks operate over long periods of deployment in difficult to reach places, their requirements may change or new code may need to be uploaded to them. The current state of the art protocols (Deluge and MNP) for network reprogramming perform the code dissemination in a multi-hop manner using a three way handshake whereby meta-data is exchanged prior to code exchange to suppress redundant transmissions. The code image is also pipelined through the network at the granularity of pages. In our work we demonstrate a protocol called Freshet for optimizing the energy for code upload and speeding up the dissemination if multiple sources of code are available. The energy optimization is achieved by equipping each node with limited non-local topology information, which it uses to determine the time when it can go to sleep since code is not being distributed in its vicinity. We are also building a framework and a system for debugging large scale sensor network applications. It relies on low overhead invariant checking, where the invariant may be local to a node or may need network communication and aggregation.
The current problems we are working on are:
- Fast code upload to large scale sensor networks
- Mitigating control attacks in mobile networks, that have sleep-awake schedules
- Building distributed trust in sensor networks based on data and control behavior
- Communication of error information for quick debugging of large scale sensor networks
Yung-Hsiang Lu (Purdue), Zhiyuan Li (CS, Purdue), Chih-Chun Wang (Purdue), Xiaojun Lin (Purdue), Luis Monstestruque (Emnet LLC).
Issa Khalil, DongHoon Shin, Jinkyu Koo, Matthew Tan Creti, Madalina Vinitila.
Ness B. Shroff (Ohio State), Jeff Bao (Motorola Labs).
Rajesh Panta (PhD), Vinai Sundaram, Carlos Perez (MS).
Papers: See publications.
3. Self-Checking Techniques for Distributed Protocols
Networked devices running software service protocols are playing an increasingly important role in the connected world of today. Given the increase in complexity and scale of these services and the fact that networks are primarily running on software architectures designed in the 1980s, we are seeing cases of spectacular failures of network services today and believe tomorrow’s services are not positioned for the continuous availability that will be needed of them. The approach we are studying for detecting a class of failures due to software defects, mis-configurations or malicious attacks employs an external monitor component. The monitor observes the interactions between the protocol participants and performs predictive or reactive detection of failures. It is important the monitor be resilient, scalable and non-intrusive to the original software. The coverage of the monitor has to be validated through failure injection to the system that mimic the real-world scenarios.
It is often not enough to detect a failure, but it is also required to diagnose the failure, i.e., to identify the source of the failure. Diagnosis is challenging because fast error propagation may occur in high throughput distributed applications. The diagnosis often needs to be probabilistic in nature due to imperfect observability of the payload system, inability to do white-box testing, constraints on the amount of state that can be maintained at the diagnostic process, and imperfect tests used to verify the system. The Monitor architecture is capable of probabilistic diagnosis of failures in large-scale network protocols. The Monitor only observes the message exchanges between the protocol entities (PEs) remotely and does not access internal protocol state. At runtime, it builds a causal & aggregate graph between the PEs based on their communication and uses this together with a rule base for diagnosing the failure. The Monitor computes for each suspected PE, a probability for the error having originated in that PE and propagated to the failure detection site. The framework is applied to a test-bed consisting of a reliable multicast protocol executing on the Purdue campus-wide network. Error injection experiments are performed to evaluate the accuracy and the performance overhead of the diagnostic.
Grid reliability: In a related project, we are building reliable protocols for a shared grid-like environment. In this environment applications can execute on shared compute nodes and be migrated off transparently when it starts competing for resources with the processes belonging to the owner of that node. To deal with such migrations and failures, we have a checkpointing solution where the checkpoints can be stored on shared storage nodes, rather than on dedicated nodes. The project is aimed at enabling reliable execution of applications in a fundamentally dynamic and unreliable environment.
Rudi Eigenmann (Purdue), Sam Midkiff (Purdue), Greg Bronevetsky and Bronis Supinski (Lawrence Livermore National Lab).
Monitor project: Fahad Arshad, Ignacio Laguna, Nawanol Theera-Ampornpunt.
Grid reliability project: Tanzima Zerin Islam.
Miguel P. Correia, Paulo VerÃssimo (University of Lisbon, Portugal).
Gunjan Khanna (PhD), Padma Varadharajan (MS). Mike Cheng (MS).
Details: Click here for a more detailed discussion of the rationale and our solution approach for the Monitor project.
Click here for a more detailed discussion of the rationale and our solution approach for the Grid reliability project.
Click here for a detailed description of our project on characterization of failures in web services.
Papers: See publications.