Fault Tolerance for Distributed Applications

To view publications by project, click the buttons down below:

2022

  1. NeurIPS
    Root Cause Analysis of Failures in Microservices through Causal Discovery,” Fault Tolerance for Distributed Applications
    Azam Ikram; Sarthak Chakraborty, Subrata Mitra, Shiv Saini (Adobe Research); Saurabh Bagchi, and Murat Kocaoglu. At the 36th Conference on Neural Information Processing Systems (NeurIPS), pp. 31158-31170, November-December 2022. (Acceptance rate: 2,665/10,411 = 25.6%)
  2. CIKM
    AutoForecast: Automatic Time-Series Forecasting Model Selection,” Fault Tolerance for Distributed Applications
    Mustafa Abdallah (Purdue); Ryan Rossi, Kanak Mahadik, Sungchul Kim, Handong Zhao, Haoliang Wang (Adobe Research); Saurabh Bagchi (Purdue). At the 31st ACM International Conference on Information and Knowledge Management (CIKM), pp. 1-10, October 2022. (Acceptance rate: 274/1175 = 23.3%) [ Dataset ]
  3. OSDI
    “ORION: Optimized Execution Latency for Serverless DAGs,” Fault Tolerance for Distributed Applications
    Ashraf Youssef Mahgoub, Edgardo Barsallo Yi; Karthick Shankar (Carnegie Mellon University); Somali Chaterji; Sameh Elnikety (Microsoft Research); Saurabh Bagchi. Accepted to appear at the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI’22), pp. 1–15, July 2022. (Acceptance rate: 49/253 = 19.4%)
  4. Sigmetrics
    “WISEFUSE: Workload Characterization and Optimized Execution Plans for Serverless DAG Workflows,” Fault Tolerance for Distributed Applications
    Ashraf Mahgoub, Edgardo Barsallo Yi; Karthick Shankar (Carnegie Mellon University); Eshaan Minocha, Somali Chaterji; Sameh Elnikety (Microsoft Research); Saurabh Bagchi. Accepted to appear at the 2022 ACM SIGMETRICS conference, pp. 1–24, June 2022. (Acceptance rate: 17/126 = 13.5% (Winter submission cycle))

2021

  1. Usenix ATC
    SONIC: Application-aware Data Passing for Chained Serverless Applications,” Fault Tolerance for Distributed Applications
    Ashraf Mahgoub, Karthick Shankar (CMU), Subrata Mitra (Adobe Research), Ana Klimovic (ETH Zurich), Somali Chaterji, and Saurabh Bagchi. At the Usenix Annual Technical Conference (Usenix ATC), pp. 1-15, July 2021. (Acceptance rate: 64/341 = 18.8%) 

2020

  1. ISM
    Closing-the-Loop: A Data-Driven Framework for Effective Video Summarization,” Fault Tolerance for Distributed Applications
    Ran Xu, Haoliang Wang (Adobe Research), Stefano Petrangeli (Adobe Research), Viswanathan Swaminathan (Adobe Research), and Saurabh Bagchi. At the 22nd IEEE International Symposium on ​Multimedia (ISM), pp. 1–8, Dec 2020. (Acceptance rate: 16/55 = 29.1%)
  2. Usenix ATC
    OptimusCloud: Heterogeneous Configuration Optimization for Distributed Databases in the Cloud,” Fault Tolerance for Distributed Applications
    Ashraf Mahgoub, Alexander Michaelson Medoff, Rakesh Kumar (Microsoft), Subrata Mitra (Adobe Research), Ana Klimovic (Google Research), Somali Chaterji, and Saurabh Bagchi. At the Usenix Annual Technical Conference (Usenix ATC), pp. 189-204, July 2020. (Acceptance rate: 65/348 = 18.7%) [ Presentation ] [ Video ]
  3. DSN
    The Mystery of the Failing Jobs: Insights from Operational Data from Two University-Wide Computing Systems,” Fault Tolerance for Distributed Applications
    Rakesh Kumar, Saurabh Jha (University of Illinois at Urbana-Champaign), Ashraf Mahgoub, Rajesh Kalyanam, Stephen L Harrell, Xiaohui Carol Song, Zbigniew Kalbarczyk (University of Illinois at Urbana-Champaign), William T Kramer (University of Illinois at Urbana-Champaign), Ravishankar K. Iyer (University of Illinois at Urbana-Champaign), and Saurabh Bagchi. At the 50th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) , pp. 158–171, June-July 2020. (Acceptance rate: 48/291 = 16.5%) [ Presentation ] [ Video ]
  4. OJCS
    Vision Paper: Grand Challenges in Resilience: Autonomous System Resilience through Design and Runtime Measures,” Fault Tolerance for Distributed Applications
    Saurabh Bagchi, Vaneet Aggarwal, Somali Chaterji, Fred Douglis, Aly El Gamal, Jiawei Han, Brian J. Henz, Hank Hoffmann, Suman Jana, Milind Kulkarni, Felix Xiaozhu Lin, Karen Marais, Prateek Mittal, Shaoshuai Mou, Xiaokang Qiu, and Gesualdo Scutari. In IEEE Open Journal of the Computer Society (OJCS), pp. 1-15, 2020, doi: 10.1109/OJCS.2020.3006807.

2019

  1. CoNLL
    SIMVECS: Similarity-based Vectors for Utterance Representation in Conversational AI Systems,” Fault Tolerance for Distributed Applications
    Ashraf Mahgoub, Youssef Shahin (Microsoft), Riham Mansour (Microsoft), and Saurabh Bagchi. At the SIGNLL Conference on Computational Natural Language Learning (CoNLL), pp. 1-10, Nov 3-4, 2019, Hong Kong. (Acceptance rate: 97/428 = 22.7%)
  2. Usenix ATC
    SOPHIA: Online Reconfiguration of Clustered NoSQL Databases for Time-Varying Workloads,” Fault Tolerance for Distributed Applications
    Ashraf Mahgoub, Paul Wood, Alexander Medoff, Subrata Mitra (Adobe Research), Folker Meyer (Argonne National Lab), Somali Chaterji, and Saurabh Bagchi. At the 2019 USENIX Annual Technical Conference (Usenix ATC), pp. 223-240, Jul 10-12, 2019, Renton, WA. (Acceptance rate: 71/356 = 19.9%) [ Presentation ] [ Lightning talk ] [ YouTube video ]
  3. ICS
    AMPT-GA: Automatic Mixed Precision Floating Point Tuning for GPU Applications,” Fault Tolerance for Distributed Applications
    Pradeep Kotipalli, Ranvijay Singh, Paul Wood, Ignacio Laguna (Lawrence Livermore National Lab), and Saurabh Bagchi. At the 33rd ACM International Conference on Supercomputing (ICS), pp. 160-170, Jun 26-28, 2019, Phoenix, AZ. (Acceptance rate: 45/193 = 23.3%) [ Presentation ] [ Slide show ]
  4. ISC
    GPUMixer: Performance-Driven Floating-Point Tuning for GPU Scientific Applications,” Fault Tolerance for Distributed Applications
    Ignacio Laguna, Paul C. Wood, Ranvijay Singh, and Saurabh Bagchi. Accepted to appear at the International Supercomputing Conference (ISC), pp. 227-246, Jun 17-19, Frankfurt, Germany. (Acceptance rate: 17/72 = 23.6%) [ Hans Meuer Award winner (best paper) ] [ Presentation ]
  5. CACM
    Dependability in Edge Computing,” Fault Tolerance for Distributed Applications
    Paul Wood, Heng Zhang, Muhammad-Bilal Siddiqui, Saurabh Bagchi. To appear in Communications of the ACM (CACM) as Contributed Article, pp. 1-16.
  6. Smoothing the path to computing: pondering uses for big data,” Fault Tolerance for Distributed Applications
    M Hall, R Ladner, D Levitt, MAP Quiñones, S Bagchi. Communications of the ACM 62 (3), 8-9.
  7. FRESCO: Open Source Data Repository for Computational Usage and Failures,” Fault Tolerance for Distributed Applications
    S Bagchi, R Kumar, R Kalyanam, S Harrell, CA Ellis, C Song. Repository documentation found here.

2018

  1. ICST
    XSTRESSOR: Automatic Generation of Large-Scale Test Inputs by Inferring Path Conditions,” Fault Tolerance for Distributed Applications
    Charitha Saumya, Jinkyu Koo, Milind Kulkarni, and Saurabh Bagchi. Accepted to appear at the 12th IEEE International Conference on Software Testing, Verification, and Validation (ICST), pp. 1-11, Apr 22-27, 2019, Xi’an, China. (Acceptance rate: 31/110 = 28.2%) [ Distinguished Paper Award (one of 3) ]
  2. ICST
    PySE: Automatic Worst-Case Test Generation by Reinforcement Learning,” Fault Tolerance for Distributed Applications
    Jinkyu Koo, Charitha Saumya, Milind Kulkarni, and Saurabh Bagchi. Accepted to appear at the 12th IEEE International Conference on Software Testing, Verification, and Validation (ICST), pp. 1-11, Apr 22-27, 2019, Xi’an, China. (Acceptance rate: 31/110 = 28.2%)
  3. Middleware
    Pythia: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads,” Fault Tolerance for Distributed Applications
    Ran Xu (Purdue University); Subrata Mitra (Adobe Research); Jason Rahman (Facebook); Peter Bai (Purdue University); Bowen Zhou (LinkedIn); Greg Bronevetsky (Google); Saurabh Bagchi (Purdue University). At the 19th ACM/IFIP International Middleware Conference, pp. 146-160, December 10-14, 2018, Rennes, France. (Acceptance rate: 22/95 = 23.2%) [ Presentation ]
  4. USENIX ATC
    VideoChef: Efficient Approximation for Streaming Video Processing Pipelines,” Fault Tolerance for Distributed Applications
    Ran Xu, Jinkyu Koo, Rakesh Kumar, Peter Bai; Subrata Mitra (Adobe Research); Sasa Misailovic (University of Illinois Urbana-Champaign); Saurabh Bagchi. At the 2018 USENIX Annual Technical Conference (USENIX ATC), pp. 43-56, July 11-13, 2018, Boston, MA. (Acceptance rate: 76/378 = 20.1%) [ Presentation ] [ Audio ]

2017

  1. ScalA
    Snowpack: Efficient Parameter Choice for GPU Kernels via Static Analysis and Statistical Prediction“, Fault Tolerance for Distributed Applications
    Ranvijay Singh, Paul Wood, Ravi Gupta (Intel), Saurabh Bagchi, Ignacio Laguna (LLNL), At the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), co-located with the IEEE/ACM Supercomputing conference, pp. 1-8, November 13, 2017, Denver, Colorado. [ Presentation ]
  2. Middleware
    Rafiki: A Middleware for Parameter Tuning of NoSQL Datastores for Dynamic Metagenomics Workloads,” Fault Tolerance for Distributed Applications
    Ashraf Mahgoub, Paul Wood, Sachandhan Ganesh, Subrata Mitra (Adobe Research), Wolfgang Gerlach (Argonne National Laboratory), Travis Harrison (Argonne National Laboratory), Folker Meyer (Argonne National Laboratory), Ananth Grama, Saurabh Bagchi, and Somali Chaterji. At the ACM/IFIP/USENIX Middleware Conference, pp. 28-40, Dec 11-15, 2017, Las Vegas, Nevada. (Acceptance rate: 20/85 = 23.5%) [ Presentation ] [ Poster ]
  3. Briefings in Bioinformatics
    Federation in Genomics Pipelines: Techniques and Challenges,” Fault Tolerance for Distributed Applications
    Somali Chaterji, Jinkyu Koo, Ninghui Li, Folker Meyer, Ananth Grama, and Saurabh Bagchi. In Oxford Briefings in Bioinformatics, pp. 1-11, Published: 29 August 2017. [ Abstract ]
  4. Briefings in Bioinformatics
    MG-RAST Version 4—Lessons learned from a decade of low-budget ultra-high throughput metagenome analysis,” Fault Tolerance for Distributed Applications
    Folker Meyer, Saurabh Bagchi, Somali Chaterji, Wolfgang Gerlach, Ananth Grama, Travis Harrison, Tobias Paczian, Will Trimble, Andreas Wilke. In Oxford Briefings in Bioinformatics, bbx105, pp. 1-12, September 2017. [ Abstract ]
  5. ACM BCB
    Scalable Genomic Assembly through Parallel de Bruijn Graph Construction for Multiple K-mers,” Fault Tolerance for Distributed Applications
    Kanak Mahadik, Christopher Wright, Milind Kulkarni, Saurabh Bagchi, Somali Chaterji. In Proceedings of the 8th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB), pp. 425-431, Aug 20-23, 2017, Boston, MA. [ Presentation ]
  6. FTXS
    Understanding the Spatial Characteristics of DRAM Errors in HPC Clusters,” Fault Tolerance for Distributed Applications
    Ayush Patwari, Ignacio Laguna, Martin Schulz, and Saurabh Bagchi. At the 7th Fault Tolerance for HPC at eXtreme Scales (FTXS) Workshop (co-located with HPDC), pp. 1-6, Jun 26, 2017, Washington DC. [ Presentation ]

2016

  1. CGO
    Phase-Aware Optimization in Approximate Computing,” Fault Tolerance for Distributed Applications
    Subrata Mitra, Manish Gupta, Sasa Misailovic (U of Illinois at Urbana-Champaign), Saurabh Bagchi. At the 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 1-12, Feb 4-8, 2017, Austin, TX. (Acceptance rate: 26/114 = 22.8%) [ Presentation ]
  2. A Study of Failures in Community Clusters: The Case of Conte,” Fault Tolerance for Distributed Applications
    Subrata Mitra, Suhas Raveesh Javagal, Amiya K. Maji (ITaP), Todd Gamblin (LLNL), Adam Moody (LLNL), Stephen Harrell (ITaP), and Saurabh Bagchi. At the 7th IEEE International Workshop on Program Debugging, co-located with ISSRE, pp. 1-8, Oct 23-27, 2016, Ottawa, Canada.[ Presentation ]
  3. SRDS
    Sirius: Probabilistic data assertions for detecting silent data corruptions in parallel programs“, Fault Tolerance for Distributed Applications
    Tara Thomas, Anmol Bhattad, Subrata Mitra, and Saurabh Bagchi. At the IEEE 35th Symposium on Reliable Distributed Systems (SRDS), pp. 1-10, September 26-29, 2016, Budapest, Hungary. (Acceptance rate: 27/83 = 32.5%)[ Presentation ]
  4. ICS
    SARVAVID: A Domain Specific Language for Developing Scalable Computational Genomics Applications“, Fault Tolerance for Distributed Applications
    Kanak Mahadik, Christopher Wright, Jinyi Zhang, Milind Kulkarni, Saurabh Bagchi, and Somali Chaterji. At the International Conference on Supercomputing (ICS), pp. 1-13, June 1-3, 2016, Istanbul, Turkey (Acceptance rate: 43/178 = 24.2%).
  5. EuroSys
    Partial-Parallel-Repair (PPR): A Distributed Technique for Repairing Erasure Coded Storage“, Fault Tolerance for Distributed Applications
    Subrata Mitra, Rajesh Krishna Panta (AT&T Labs), Moo-Ryong Ra (AT&T Labs), Saurabh Bagchi. At the European Conference on Computer Systems (EuroSys), pp. 1-14, April 18-21, 2016, London, UK (Acceptance rate: 38/180 = 21.1%). [ Presentation ]

2015

  1. PACT
    Dealing with the Unknown: Resilience to Prediction Errors“, Fault Tolerance for Distributed Applications
    Subrata Mitra, Greg Bronevetsky, Suhas Javagal and Saurabh Bagchi. At the 24th International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 1-10, October 18-21, 2015, San Francisco, CA. (Acceptance rate: 38/179 = 21.2%) [ Presentation ]
  2. BCB
    An Ensemble SVM Model for the Accurate Prediction of Non-Canonical MicroRNA Targets“, Fault Tolerance for Distributed Applications
    Asish Ghoshal, Ananth Grama, Saurabh Bagchi and Somali Chaterji. At the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics (BCB), pp. 403-412, September 9-12, 2015, Atlanta, GA. (Acceptance rate: 48/141 = 34%) (Winner of the best paper award)

2014

  1. Middleware
    Mitigating Interference in Cloud Services by Middleware Reconfiguration,” Fault Tolerance for Distributed Applications
    Amiya Maji, Subrata Mitra, Bowen Zhou, Saurabh Bagchi and Akshat Verma (IBM Research). At the 15th ACM/IFIP/USENIX Middleware conference, pp. 1-12, Nov 16-21, 2014. (Acceptance rate: 27/144 = 18.8%) [ Presentation ]
  2. Supercomputing
    Orion: Scaling Genomic Sequence Matching with Fine-Grained Parallelization,” Fault Tolerance for Distributed Applications
    Kanak Mahadik, Somali Chaterji, Bowen Zhou, Milind Kulkarni, and Saurabh Bagchi. At the International Conference for High Performance Computing, Networking, Storage, and (Supercomputing), pp. 1-11, Nov 16-21, 2014. (Acceptance rate: 82/394 = 20.8%) [ Presentation ] [ Abstract ]
  3. ICAC
    Is Your Web Server Suffering from Undue Stress due to Duplicate Requests?,” Fault Tolerance for Distributed Applications
    Fahad A. Arshad, Amiya K. Maji, Sidharth Mudgal, and Saurabh Bagchi. As a Short Paper, At the 11th International Conference on Autonomic Computing (ICAC), pp. 105-111, June 18-20, 2014, Philadelphia, PA. (Acceptance rate: 12 (full papers) + 10 (short papers)/53 = 41.5%) [Presentation ] [ Abstract ]
  4. TPDS
    Diagnosis of Performance Faults in Large Scale MPI Applications via Probabilistic Progress-Dependence Inference,” Fault Tolerance for Distributed Applications
    Ignacio Laguna (LLNL), Dong Ahn (LLNL), Bronis de Supinski (LLNL), Saurabh Bagchi, and Todd Gamblin (LLNL), Accepted to appear in IEEE Transactions on Parallel and Distributed Systems (TPDS), pp. 1-15, notification of acceptance: March 2014. [ Presentation ] [ Abstract ]
  5. PLDI
    Accurate Application Progress Analysis for Large-Scale Parallel Debugging,” Fault Tolerance for Distributed Applications
    Subrata Mitra, Ignacio Laguna, Dong H. Ahn, Saurabh Bagchi, Martin Schulz, and Todd Gamblin. At the ACM International Symposium on Programming Language Design and Implementation (PLDI), pp. 193-203, Edinburgh, UK, June 9-11, 2014. (Acceptance rate: 52/287 = 18.1%) [ Abstract ] [Presentation ]

2013

  1. ISSRE
    Characterizing Configuration Problems in Java EE Application Servers: An Empirical Study with GlassFish and JBoss,” Fault Tolerance for Distributed Applications
    Fahad A. Arshad, Rebecca J. Krause, and Saurabh Bagchi, At the 24th IEEE International Symposium on Software Reliability Engineering (ISSRE), pp. 1-10, Pasadena, CA, November 4-7, 2013. (Acceptance rate: 46/131 = 35.1%)[ Abstract ] [Presentation ]
  2. SRDS
    Automatic Problem Localization in Distributed Applications via Multi-dimensional Metric Profiling,” Fault Tolerance for Distributed Applications
    Ignacio Laguna, Subrata Mitra, Fahad A. Arshad, Nawanol Theera-Ampornpunt, Zongyang Zhu, Saurabh Bagchi, Samuel P. Midkiff, Mike Kistler (IBM Research), and Ahmed Gheith (IBM Research), At the 32nd International Symposium on Reliable Distributed Systems (SRDS), pp. 121-132, Braga, Portugal, September 30-October 3, 2013. (Acceptance rate: 22/67 = 32.8%) [ Presentation ] [ Abstract ]
  3. HPDC
    WuKong: Automatically Detecting and Localizing Bugs that Manifest at Large System Scales,” Fault Tolerance for Distributed Applications
    Bowen Zhou, Jonathan Too, Milind Kulkarni, and Saurabh Bagchi. At the 22nd ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), pp. 131-142, New York City, NY, June 17-21, 2013. (Acceptance rate: 20/131 = 15.3%) [ Presentation ] [Abstract ]

2012

  1. HotDep
    ABHRANTA: Locating Bugs that Manifest at Large System Scales,” Fault Tolerance for Distributed Applications
    Bowen Zhou, Milind Kukarni, and Saurabh Bagchi. At the 8th Workshop on Hot Topics in System Dependability (HotDep) (co-located with OSDI ’12), pp. 1-6, Hollywood, CA, October 7, 2012. (Acceptance rate: 10/24 = 41.7%) [ Presentation ] [ Abstract ]
  2. Supercomputing
    mcrEngine: A Scalable Checkpointing System using Data-Aware Aggregation and Compression,” Fault Tolerance for Distributed Applications
    Tanzima Zerin Islam, Kathryn Mohror, Saurabh Bagchi, Adam Moody, Bronis R. de Supinski, and Rudolf Eigenmann. At the IEEE/ACM International Conference for High Performance Computing, Networking, Storage, and Analysis (Supercomputing), pp. 1-10, Salt Lake City, Utah, November 10-16, 2012. (Acceptance rate: 100/472 = 21.2%) (One of 8 papers that is a finalist for the best student paper) [ Presentation ][ Abstract ]
  3. PACT
    Probabilistic Diagnosis of Performance Faults in Large Scale Parallel Applications,” Fault Tolerance for Distributed Applications
    Ignacio Laguna, Dong H. Ahn, Bronis R. de Supinski, Saurabh Bagchi, and Todd Gamblin. At the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 1-10, September 19-23, 2012, Minneapolis, MN. (Acceptance rate: 39/207 = 18.8%) [Presentation ] [ Abstract ]
  4. DSN
    Automatic Fault Characterization via Abnormality-Enhanced Classification,” Fault Tolerance for Distributed Applications
    Greg Bronevetsky (LLNL), Ignacio Laguna, Saurabh Bagchi and Bronis R. de Supinski (LLNL). In the 42th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1-12, Boston, MA, June 25-28, 2012 (Acceptance rate: 51/236 = 21.6%) [ Presentation ] [Abstract ]
  5. DSN
    A Study of Soft Error Consequences in Hard Disk Drives,” Fault Tolerance for Distributed Applications
    Timothy Tsai (Hitachi GST), Nawanol Theera-Ampornpunt and Saurabh Bagchi. In the 42th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) (Practical Experience Report), pp. 1-8, Boston, MA, June 25-28, 2012 (Acceptance rate: 51/236 = 21.6%) [ Presentation ] [Abstract ]

2011

  1. The NEEShub Cyberinfrastructure for Earthquake Engineering“, Fault Tolerance for Distributed Applications
    Thomas J. Hacker, Rudi Eigenmann, Saurabh Bagchi, Ayhan Irfanoglu, Santiago Pujol, Ann Catlin, Ellen Rathje IEEE Computing in Science and Engineering, vol. 13, issue 4, pp. 67-78, July-August 2011
  2. Supercomputing
    Large Scale Debugging of Parallel Tasks with AutomaDeD,Fault Tolerance for Distributed Applications
    Ignacio Laguna, Todd Gamblin, Bronis R. de Supinski, Saurabh Bagchi, Greg Bronevetsky, Dong H. Ahn, Martin Schulz, and Barry Rountree, At the Supercomputing Conference, 12 pages, Seattle, WA, Nov 12-18, 2011. (Acceptance rate: 74/352 = 21.0%) [ Presentation ] [ Abstract ]
  3. HPDC
    Vrisha: Using Scaling Properties of Parallel Programs for Bug Detection and Localization,” Fault Tolerance for Distributed Applications
    Bowen Zhou, Milind Kulkarni, and Saurabh Bagchi, At the 20th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 12 pages, San Jose, California, June 8-11, 2011. (Acceptance rate: 22/170 = 12.9%) [ Presentation ] [ Abstract ]

2010

  1. ISSRE
    Characterizing Failures in Mobile OSes: A Case Study with Android and Symbian“: Fault Tolerance for Distributed Applications
    Amiya Kumar Maji, Kangli Hao, Salmin Sultana, and Saurabh Bagchi. At the 21st annual International Symposium on Software Reliability Engineering (ISSRE 2010), 10 pages, Nov 1-4, 2010, San Jose, California. (Acceptance rate: 40/130 = 30.8%) [ Abstract ]
  2. DSN
    AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks“: Fault Tolerance for Distributed Applications
    Greg Bronevetsky, Ignacio Laguna, Saurabh Bagchi, Bronis R. de Supinski, Dong H. Ahn, and Martin Schulz. In the 40th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 10 pages, June 28-July 1, 2010, Chicago, IL. (Acceptance rate (DCCS track): 40/174 = 23%) [Presentation ] [ Abstract ]

2009

  1. Middleware
    How To Keep Your Head Above Water While Detecting Errors“: Fault Tolerance for Distributed Applications
    Ignacio Laguna, Fahad A. Arshad, David M. Grothe, and Saurabh Bagchi. In: ACM/IFIP/USENIX 10th International Middleware Conference, November 30-December 4, 2009, Urbana-Champaign, Illinois. (Acceptance rate: 21/110 = 19.1%) [ Presentation ] [ abstract ]
  2. Supercomputing
    FALCON: A System for Reliable Checkpoint Recovery in Shared Grid Environments“: Fault Tolerance for Distributed Applications
    Tanzima Zerin, Saurabh Bagchi, and Rudolf Eigenmann. In: the ACM/IEEE Supercomputing Conference, November 14-20, 2009, Portland, Oregon. (Acceptance rate: 59/261 = 22.6%) (Nominated as one of 4 best student papers) [ Presentation ] [ abstract ]

2008

2007

  1. SRDS
    Stateful Detection in High Throughput Distributed Systems“: Fault Tolerance for Distributed Applications
    Gunjan Khanna, Ignacio Laguna, Fahad A. Arshad, and Saurabh Bagchi. In: 26th IEEE International Symposium on Reliable Distributed Systems (SRDS-2007), pp. 275-287, Beijing, CHINA, October 10-12, 2007. (Acceptance rate: 29/185 ~ 15.7%) [ Presentation ] [ abstract ]
  2. SRDS
    Distributed Diagnosis of Failures in a Three Tier E-Commerce System“: Fault Tolerance for Distributed Applications
    Gunjan Khanna, Ignacio Laguna, Fahad A. Arshad, and Saurabh Bagchi. In: 26th IEEE International Symposium on Reliable Distributed Systems (SRDS-2007), pp. 185-198, Beijing, CHINA, October 10-12, 2007. (Acceptance rate: 29/185 ~ 15.7%) [ Presentation ] [ abstract ]
  3. HPDC
    Failure-Aware Checkpointing in Fine-Grained Cycle Sharing Systems“: Fault Tolerance for Distributed Applications
    Xiaojuan Ren, Rudolf Eigenmann, and Saurabh Bagchi. In: 16th IEEE International Symposium on High Performance Distributed Computing (HPDC-16), Monterey Bay, California, June 27-29, 2007. (Acceptance rate: 20%). [ Presentation ] [ abstract ]
  4. TDSC
    Automated Rule-Based Diagnosis through a Distributed Monitor System“: Fault Tolerance for Distributed Applications
    Gunjan Khanna, Mike Yu Cheng, Padma Varadharajan, Saurabh Bagchi, Miguel P. Correia, and Paulo J. Verissimo. In: IEEE Transactions on Dependable and Secure Computing (TDSC), notificacion of acceptance: May 2007. [ abstract ]
  5. JOGC
    Prediction of Resource Availability in Fine-Grained Cycle Sharing Systems and Empirical Evaluation“, Fault Tolerance for Distributed Applications
    Xiaojuan Ren, Seyong Lee, Rudolf Eigenmann, and Saurabh Bagchi. In Springer’s Journal of Grid Computing (JOGC), vol. 5, no. 2, pp. 173-195, 2007. [ abstract ]

2006

  1. ICCD
    Pesticide: Using SMT Processors to Improve Performance of Pointer Bug Detection,” Fault Tolerance for Distributed Applications
    Jin-Yi Wang, Yen-Shiang Shue, T N Vijaykumar, and Saurabh Bagchi. 24th International Conference of Computer Design (ICCD), Oct 1-4, 2006, San Jose, California, USA.
  2. DSN
    Providing Automated Detection of Problems in Virtualized Servers using Monitor framework,” Fault Tolerance for Distributed Applications
    Gunjan Khanna, Saurabh Bagchi, Kirk Beaty, Andrzej Kochut, and Gautam Kar. Workshop on Applied Software Reliability (WASR) at the International Conference on Dependable Systems and Networks (DSN), June 25-28, 2006, Philadelphia, Pennsylvania, USA. [ Presentation]
  3. HPDC
    Resource Failure Prediction in Fine-Grained Cycle Sharing Systems,” Fault Tolerance for Distributed Applications
    Xiaojuan Ren, Seyong Lee, Rudolf Eigenmann, and Saurabh Bagchi. 15th IEEE International Symposium on High Performance Distributed Computing (HPDC-15), 19-23 June 2006, Paris, France. (Acceptance rate: 24/157 ~ 15%). [ Presentation ]
  4. TDSC
    Automated Online Monitoring of Distributed Applications through External Monitors,” Fault Tolerance for Distributed Applications
    Gunjan Khanna, Padma Varadharajan, and Saurabh Bagchi. IEEE Transactions on Dependable and Secure Computing (TDSC), vol. 3, no. 2, pp. 115-129, Apr-Jun, 2006.

2005

  1. Probabilistic Diagnosis through Non-Intrusive Monitoring in Distributed Applications,” Fault Tolerance for Distributed Applications
    Gunjan Khanna, Yu Cheng, Saurabh Bagchi, Miguel Correia, and Paolo Verissimo. Purdue ECE Technical Report 05-19, December 2005.
  2. SRDS
    LRRM: A Randomized Reliable Multicast Protocol for Optimizing Recovery Latency and Buffer Utilization,” Fault Tolerance for Distributed Applications
    Nipoon Malhotra, Shrish Ranjan, and Saurabh Bagchi. 24th IEEE Symposium on Reliable Distributed Systems (SRDS 2005), October 26-28, 2005, Orlando, Florida, USA.(Acceptance rate: 20/67 ~ 29.9%) [ Camera ready ].
  3. Automated Monitor Based Diagnosis in Distributed Systems,” Fault Tolerance for Distributed Applications
    Gunjan Khanna, Padma Varadharajan, Mike Cheng, and Saurabh Bagchi, Purdue ECE Technical Report 05-13, August 2005.

2004

  1. SRDS
    Self Checking Network Protocols: A Monitor Based Approach,” Fault Tolerance for Distributed Applications
    Gunjan Khanna, Padma Varadharajan, and Saurabh Bagchi. 23rd International Symposium on Reliable Distributed Systems (SRDS 2004), October 2004. (Acceptance rate:27/117 ~ 23.1%)
    [ Camera Ready ] [ Presentation ]
  2. PRDC
    Failure Handling in a Reliable Multicast Protocol for Improving Buffer Utilization and Accommodating Heterogeneous Receivers,” Fault Tolerance for Distributed Applications
    Gunjan Khanna, John Rogers, and Saurabh Bagchi. In Proceedings of the 10th IEEE Pacific Rim Dependable Computing Conference (PRDC’ 04), March 2004. (Acceptance rate: 34/102 ~ 33.3%) [ Camera ready ]

2003

  1. Self-Checking Network Protocols: A Monitor Based Approach,” Fault Tolerance for Distributed Applications
    Gunjan Khanna, MS Thesis. December 2003.
  2. Light-Weight Randomized Reliable Multicasting Protocol,” Fault Tolerance for Distributed Applications
    Nipoon Malhotra, Shrish Ranjan, and Saurabh Bagchi. Appeared in Fast Abstracts, DSN2003.
Copyright notice: Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional
purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work
in other works must be obtained from the appropriate publisher (IEEE, ACM, Elsevier, etc.)

Last modified: April 2, 2023