Skip navigation

CRISP researcher presents result on dealing with fail-silent errors in programs at SRDS in Budapest

CRISP researcher presents result on dealing with fail-silent errors in programs at SRDS in Budapest

Event Date: September 26, 2016
Priority: No
CRISP researcher, Tara Thomas, will present her research result on detecting fail silent errors in large-scale software programs at the 35th Symposium on Reliable Distributed Systems (SRDS) to be held in Budapest, Hungary September 26-29, 2016.

The SRDS conference highlights research in building reliable distributed systems. At this year's meeting, Tara will present research results performed jointly with Anmol Bhattad, Subrata Mitra, and Prof. Saurabh Bagchi. The size and complexity of supercomputing clusters are rapidly increasing to cater to the needs of complex scientific applications. At the same time, the feature size and operating voltage level of the internal components are decreasing. This dual trend makes these machines extremely vulnerable to soft errors or random bit flips. For complex parallel applications, these soft errors can lead to silent data corruption which could lead to large inaccuracies in the final computational results. Hence, it is important to determine the presence and severity of such errors early on, so that proper counter measures can be taken. In this paper, we introduce a tool called Sirius, which can accurately identify silent data corruptions based on the simple insight that there exist spatial and temporal locality within most variables in such programs.

Sirius uses neural networks to learn such locality patterns, separately for each critical variable, and produces probabilistic assertions which can be embedded in the code of the parallel program to detect silent data corruptions. We have implemented this technique on parallel benchmark programs - LULESH and CoMD. Our evaluations show that Sirius can detect silent errors in the code with much higher accuracy compared to previously proposed methods.Sirius detected 98% of the silent data corruptions with a false positive rate of less than 0:02 as compared to the false positive rate 0:06 incurred by the state of the art acceleration based prediction (ABP) based technique.