Failure Detection and Prediction through Metrics

Overview

Summary

Today’s distributed applications are composed of a large number of hardware and software components. Many of these applications require continuous availability despite being built out of unreliable components. Therefore, system administrators need efficient techniques and practical tools for error-detection that can operate online (as the application runs). Preventing an error from becoming a user-visible failure is a desirable characteristic. Automatically predicting impending failures based on observed patterns of measurements can trigger prevention techniques, such as microrebooting, redirection of further requests to a healthy server, or simply starting a backup service for the data.

Today’s enterprise-class distributed systems routinely collect a plethora of metrics by monitoring at various layers—system-level, middleware-level, and application level. Many commercial and open-source tools exist for collecting these metrics, such as HP OpenView, Sysstat, and Ganglia. A common class of error-detection techniques works as follows. From values of metrics collected during training runs, a model is built up for how the metrics should behave during normal operation. At runtime, a comparison is made between what is indicated by the trained model and what metric values are observed in the system. If there is sufficient divergence between the two, an error is flagged.

Existing approaches toward building error-detection systems based on statistical analysis of runtime metrics suffer from one or more of the following problems:

(a) Their models do not consider relationship between metrics
(b) Some models use only the current snapshot of measurements
(c) The overwhelming majority of techniques do not offer failure prediction.
(d) Existing approaches often consider a restricted set of metrics for modeling.

We describe Augury, a error detection and failure prediction tool that overcomes the problems that are present in existing approaches. Augury addresses the problems by combining the following techniques:

(1) Sequential Multi-Metric Analysis: We address the first challenge by considering a large set of metrics from the system, middleware, and application levels. Augury uses pairwise correlations between metrics to detect errors. With this approach, even when metric values go out of range previously seen before, we will only flag an alarm if the correlation between that metric and another breaks, thus reducing the false positive rate.

(2) Failure Prediction: Augury has a predictive operational mode that uses ARIMA time series model (created offline using training data of typical workloads) and recent measurements to forecast the metric values in the immidiate future. Through this mode, Augury is able to predict impending failures with higher lookahead time, which is desirable since the prediction is only useful when there is enough time for the recovery mechanism to complete before the failure occurs.

We evaluate our approach in synthetic fault injection experiments as well as two real-world cases: StationsStat and Android OS. StationsStat is a multi-tier application that is used to check the availability of workstations on Purdue’s computing labs. The application suffered from an unknown bug that made it fail periodically by becoming unresponsive to end users. Augury predicted the majority of the failure cases with 51 minutes of lookahead time on average. Furthermore, Augury pinpointed the metrics that were mostly associated with the problem in a blind-study in which we did not know the original problem’s root-cause. Subsequently, the developer of the application confirmed that the likely root cause is pointed to by the metrics that Augury determined. In the Android OS case, multiple versions of the Android OS with known bugs are used to evaluate Augury in detecting and predicting failures. Our fault injection experiments are carried out using the application RUBiS

Achieved Technical Goals

Synthetic fault injection: In these experiments, we were able to achieve high accuracy (both recall and precision). When compared to polynomial regression, our approach produces better accuracy overall (as seen in the chart below).

Performance Result: The amount of time it takes to execute all online steps in Augury is less than 10 msec on average. This includes selecting the best ARIMA models, performing forecasts, and calculating the nearest-neighbor distance of correlation coefficient vectors. We varied the number of metrics being used in the analysis and verified that the overall time grows almost quadratically as we expected—the complexity of calculating correlations for n metrics is O(n2). The results show that it is possible to perform the entire online analysis for more than 800 metrics in less than a second.

Android Case Study: We use two prior-documented bug cases in the Android OS to evaluate Augury. The two bugs are file descriptor leak and HTTPS request hang. In both cases, we developed a program that mimics a realistic workload and activate the bug. As seen in the two charts below, we were able to detect the error soon after the bug is activated.

StationsStat Case Study: We evaluate Augury on a real production multi-tier application that is used to check the availability of workstations on Purdue campus. Due to an unknown bug in the application, periodic failures are observed in which the application becomes unresponsive. These failures are made visible to system administrators through alerts of their monitoring system, Nagios, or by user phone calls reporting the problem. When each failure is observed, the application is restarted and the problem appears to go away temporarily. Our result is shown below. Notice how distance peaks occur when failures occur. Thresholds are marked with dashed lines. Failures are critical alerts, warning alerts, missing data events.

Vrisha is a scheme designed to detect and localize scale-dependent bugs in parallel and distributed systems. Scale-dependent indicates that these bugs will only become visible when executed on a system at scale. This property makes the detection and localization of such bugs very difficult in the development and testing phase, where small scale systems are used. It is a common scenario that such bugs are found after the application is deployed on customer’s multi-million supercomputers and waste a great deal of time, money and energy.

There are two challenges in designing a detection and localization scheme for scale-dependent bugs in the context of parallel and distributed computing systems. Suppose we have a bug report from the buggy application running on a large system scaled at P. (1) We have no access to a bug-free run on the P-scale system as the bug could be triggered in any run on the system. (2) The bug is invisible on systems scaled less than P. To summarize, we could collect bug-free data, as the bug is invisible, from small scale runs and buggy or bug-free data, which we do not know a priori, from large scale runs.

We designed a special program behavior modeling technique in Vrisha to address the two challenges. Conceptually, a program’s execution data is split into two categories: control features are those which entirely determine the observed behavior of a program, such as command line arguments, the number of processors in the system, and the size of input, etc., while observational features are the program behavior that we could collect in the runtime, such as the number of messages sent at each distinct location in the program, the results of conditional branches in the execution, etc. We build a Kernel Canonical Correlation Analysis model between the control and the observational features. With the help of the KCCA model, we can predict what would be the righteous observational features in the large scale run based on the relationship between the control and the observational features collected at small scale runs. With this capability of behavior predication, bug detection at scale becomes a easy task of comparing the predicted result and the actual value and flag a bug candidate whenever the deviation is larger than a predefined threshold (through training with bug-free data to make sure no false positive would be caused by the threshold).

Students

  • Ignacio Laguna, ilaguna AT purdue DOT edu
  • Nawanol Theera-Ampornpunt, ntheeraa AT purdue DOT edu
  • Paul Rosswurm, prosswur AT purdue DOT edu
Last modified: March 18, 2015

Download Software