Root Cause Analysis of Failures in Microservices through Causal Discovery

Abstract

Most cloud applications use a large number of smaller sub-components (called microservices) that interact with each other in the form of a complex graph to provide the overall functionality to the user. While the modularity of the microservice architecture is beneficial for rapid software development, maintaining and debugging such a system quickly in cases of failure is challenging. We propose a scalable algorithm for rapidly detecting the root cause of failures in complex microservice architectures. The key ideas behind our novel hierarchical and localized learning approach are: (1) to treat the failure as an intervention on the root cause to quickly detect it, (2) only learn the portion of the causal graph related to the root cause, thus avoiding a large number of costly conditional independence tests, and (3) hierarchically explore the graph. The proposed technique is highly scalable and produces useful insights about the root cause, while the use of traditional techniques becomes infeasible due to high computation time. Our solution is application agnostic and relies only on the data collected for diagnosis. For the evaluation, we compare the proposed solution with a modified version of the PC algorithm and the state-of-the-art for root cause analysis. The results show a significant improvement in top-k recall while significantly reducing the execution time.

Biography

I am a second-year Ph.D. student at ECE Purdue University working with Prof. Saurabh Bagchi. I’m broadly interested in large-scale distributed systems, cloud computing, computer networks, and applied machine learning in systems. My current research work focuses on providing fault diagnosability for cloud applications through causal inference. Another part of my work concentrates on making the serverless framework more applicable to the end user by reducing the latency and cost of serverless DAGs. Before joining Purdue, I was at LUMS where I worked with Prof. Zafar Ayyub Qazi to improve the latency of cellular control plane messages of 4G/LTE.

Video of the Talk