Dependability in the Large

(1) Seamless application flitting between mobile, edge, and cloud

We are developing foundational methods for an application to “flit” between a mobile device, an edge device, and the cloud. The challenge is to fit within the resource bounds of each platform, maintain the application requirements (such as for latency and accuracy), and handle failures in the platforms or the networks connecting them.

Within this space, we are innovating in providing reliability guarantees while approximating computation. Approximate computing has seen some significant achievements of late — this is based on the intuition that not all computation needs to be done exactly (think machine learning classifier, humans watching video, or numerical computation where the floating point value can afford to be inaccurate after the 5th decimal point). This saves us in terms of computation time and energy, which are especially valuable for embedded or mobile devices. However, one challenge has been can I provide guarantees (even if probabilistic) as we are doing such approximations. Our insight has been that such probabilistic guarantees are indeed possible, if we can do approximation in a content-aware manner. A highly complex scene cannot be approximated much, while a simple scene can. Our work is developing a suite of techniques, for scientific numerical computation and streaming applications (like video streaming), where the system can make approximations in a content-aware manner bounding the loss in accuracy.

(2) Reliable updates of distributed systems for changing workloads

A distributed system has many executing components. Each component has some optimal setting for a given workload (such as, the number of threads for writing, the amount of memory allocated to the component) and workloads tend to change in a production distributed environment. Further, the environment is heterogeneous and needs to support multiple workloads concurrently. The technical challenge is how to continually update the parameters of the distributed system in the face of changing workloads. The requirement is that the data availability or the consistency factor (such as, for a distributed database) does not suffer.

The solution methods that are developing can navigate the large parameter search space, can automatically identify close-to-optimal configurations for static workloads, and for dynamic workloads, it designs a cost-benefit analysis which allows it to figure out if it is worthwhile reconfiguring the distributed system. For the last aspect, it performs workload prediction to determine how the workload will look in the future and reconfigures accordingly. The reconfiguration itself is done through a distributed protocol that maintains the data availability and the consistency requirements of the user.

(3) Open repository and analysis of system usage data

Dependability has become a necessary requisite property for many of the computer systems that surround us or work behind the scenes to support our personal and professional lives. Heroic progress has been made by computer systems researchers and practitioners working together to build and deploy dependable systems. However, an overwhelming majority of this work is not based on real publicly available failure data. As a result, results in small lab settings are sometime disproved years later, many avenues of productive work in dependable system design are closed to most researchers, and conversely, some unproductive work gets done based on faulty assumptions about the way real systems fail. Unfortunately, there does not exist any open system usage and failure data repository today for any recent computing infrastructure that is large enough, diverse enough, and with enough information about the infrastructure and the applications that run on them. We are addressing this pressing need that has been voiced repeatedly by computer systems researchers from various sub-domains.

The project is collecting, curating, and presenting public failure data of large-scale computing systems, in a repository called FRESCO. Our initial sources are Purdue, U of Illinois at Urbana-Champaign, and U of Texas at Austin. The data sets comprise static and dynamic information about system usage and the workloads, and failure information, for both planned and unplanned outages. We are performing data analytics on these datasets to answer various questions, such as: (1) How do jobs utilize cluster resources in a university centrally managed cluster? (2) How do users use or do not use the options to share resources on a node? (3) How often are the typical resources (compute, memory, local IO, remote IO, networking) overstretched by the demand and does such contention affect the failure rates of jobs? (4) Can users estimate the time their jobs will need on the cluster?

OSDI 2022 - Orion

CVPR 2022 - SmartAdapt

SenSys 2020 - ApproxDet

ATC 2021 - Sonic

(1) Seamless application flitting between mobile, edge, and cloud

(2) Reliable updates of distributed systems for changing workloads

(3) Open repository and analysis of system usage data