Skip navigation

Resilient and Adaptive Cyberinfrastructures

Distributed systems and applications are becoming increasingly pervasive in today’s world providing the core infrastructures for the largest commercial and scientific applications. The complexity and scale of these applications increase continuously as they span a larger number of software components, parallel tasks, and computing nodes. For example, large-scale applications running in today’s data centers and supercomputers span thousands of computing nodes with multiple cores per node. With this increasing trend in complexity and scale, it also becomes increasingly difficult to detect errors, performance anomalies, and unexpected behavior in these applications.

Thrust 1 image

(Predictive Reliability Engine for Cellular Networks)

 

Our work is developing efficient and distributed techniques to detect errors in such systems, including due to maliciously induced security attacks, localize the errors, contain them from spreading through the entire interconnected system, and finally recovering the system to an operational state. Our domain areas include networks of embedded devices, scientific applications running on supercomputing clusters, streaming data analytics applications, and computational genomics applications.

Thrust 1 image

One aspect of this thrust deals with how to improve systems operations and architectures for increased resilience. Toward this goal, we are developing foundational theories and optimization tools to exploit real world system structures that can lead to computationally efficient and distributed solutions, and apply them to improve the systems. Recent efforts have been devoted to develop new optimization techniques, based on parallel and distributed processing, for efficient analytics on big data of different kind. With pervasive sensors continuously collecting massive amounts of information as well as advances in computing, communication, and storage technologies, this is an era of data deluge. However, the sheer volume and the increasingly distributed nature of data together with the growing complexity of the data models (nonconvex and nonlinear) present major challenges to modern analytics. Our current research aims at addressing these issues to fully realize the visionary benefits of big data.

Another aspect of building resilient cyberinfrastructure is making sure that infrastructure is adaptive. The hardware that the cyberinfrastructure is deployed on is becoming increasingly heterogeneous, increasingly complex, and increasingly unreliable. As a result, it is important that software be able to adapt to different hardware contexts, in two different ways: being resilient to hardware failures, which may require remapping software to new hardware at runtime, and being adaptive to changing hardware availability, which requires that the same software be able to execute effectively on a wide variety of hardware platforms. Our research on this front proceeds along several lines: building programming models that allow developers to write software that transparently adapts to changing hardware resources; developing performance models that allow software to be automatically tuned to complex hardware platforms; developing optimizations and transformations that allow applications to be automatically transformed to map more effectively to new hardware platforms.

Resources

Resilient and adaptive cyberinfrastructures powerpoint

Responsible faculty: Milind Kulkarni (lead), Saurabh Bagchi, Gesualdo Scutari, Jitesh Panchal