October 2022

Our work with Adobe Research on root cause diagnosis of failures in cloud applications is accepted to appear at NeurIPS. The work was ably led by DCSL PhD student, Azam Ikram (that too in his first year of graduate work), with close cooperation with Sarthak Chakraborty, Research Associate at Adobe Research. The team includes causal theory expert and fellow faculty at Purdue, Murat Kocaoglu, and Adobe researchers, Shiv Saini and Subrata Mitra.

Muhammad Azam Ikram; Sarthak Chakraborty, Subrata Mitra, Shiv Saini (Adobe Research); Saurabh Bagchi, and Murat Kocaoglu, “Root Cause Analysis of Failures in Microservices through Causal Discovery,” Accepted to appear at the 36th Conference on Neural Information Processing Systems (NeurIPS), pp. 1-13 (supplementary pp. 1-7), 2022.

Acceptance rate: 2,665/10,411 = 25.6%
Azam Ikram, PhD student, and lead author of this study

Root Cause Analysis (RCA) is widely used to ensure the reliability of production systems, including cloud-based systems. Causal structure discovery-based Root Cause Analysis techniques have been used recently to find the root cause(s) of a fault in cloud applications. The goal is to construct a graph with nodes as metrics and a directed edge between two nodes showing the direction and magnitude of the causal effect. A direct application of causal discovery algorithms like PC [47] is infeasible for a microservice system due to a large number of metrics. Existing approaches reduce the number of nodes by using a feature selection approach or selecting a specific set of metrics. An obvious problem with this approach is that the selected set of metrics might not include the root cause metric(s). In addition, the feature selection step might introduce latent variables, which renders the popularly used PC algorithm inapplicable. Furthermore, most of these approaches only rely on observational data and therefore do not utilize the potential invariance present in the interventional data to learn the underlying causal structure.

We propose a scalable algorithm for rapidly detecting the root cause of failures in complex microservice architectures. We demonstrate this on a synthetic microservice-based application that we stand up and stress with various kinds of requests. We also demonstrate the feasibility of our approach on real production failures from an Adobe cloud-based service.