Software & Datasets

Here is a listing of our open source software and open datasets. We believe strongly in the importance of open sharing of software and datasets. In particular, we believe in the freedom arising from copyleft licenses. All the material here is licensed under the GNU General Public License (GPL), unless otherwise specified. We request that where you use the dataset or software, please cite the paper mentioned underneath.

  1. Time series model selection: A collection of uni-variate and Multi-variate testbeds and a suite of time series forecasting algorithms benchmarked on them. This is the corpus that we use for evaluating our technique for fast automatic selection of the best forecasting model for a new unseen time-series dataset, without having to first train (or evaluate) all the models on the new time-series data to select the best one. This corpus has a total of 308 uni-variate and 40 multi-variate time series datasets and 6 prediction techniques. Significantly, this corpus contains an Adobe trace dataset that records CPU and Memory usage for 50 different services running in Adobe production clusters collected for 15 days from May 1–15, 2021.
  2. AutoForecast: Automatic Time-Series Forecasting Model Selection,” Mustafa Abdallah (Purdue); Ryan Rossi, Kanak Mahadik, Sungchul Kim, Handong Zhao, Haoliang Wang (Adobe Research); Saurabh Bagchi (Purdue). To appear at the 31st ACM International Conference on Information and Knowledge Management (CIKM), pp. 1-10, October 2022.

  3. Mobility prediction for crowdsensing: Our EWSN 2020 paper on predicting mobility of users,  students on campus, as they were enrolled in a crowdsensing campaign. The campaign went on over one month with 50 users to evaluate our solution called CrowdBind plus 4 competing solutions. The students went about their daily routines with their smartphones running the five different software packages, collecting sensor data (e.g. pressure). This is the anonymized mobility trace of these 50 students over a 100 sq. km. area.

    CrowdBind: Fairness Enhanced Late Binding Task Scheduling in Mobile Crowdsensing,” Heng Zhang, Michael A. Roth (Google), Rajesh K. Panta (AT&T Labs Research), He Wang, and Saurabh Bagchi. At the 17th International Conference on Embedded Wireless Systems and Networks (EWSN), pp. 1-12, Feb 17-19, 2020, Lyon, France. (Best paper award winner)

  4. Bluetooth proximity tracing: Our CPSIoTSec 2020 paper on Bluetooth proximity data. We conducted a measurement study and collected traces of Bluetooth advertisements from 49 students on Purdue campus over a period of two weeks in Feb-Mar, 2019. The participating users were asked to install an Android app written by us, which periodically collected and uploaded traces of Bluetooth advertisements and device locations.

    Privacy in the Mobile World: An Analysis of Bluetooth Scan Traces,” Heng Zhang, Amiya K. Maji, and Saurabh Bagchi. At the 2020 Joint Workshop on CPS&IoT Security and Privacy (CPSIoTSec), co-located with ACM Conference on Computer and Communications Security (CCS), pp. 1-5, November 9, 2020.

  5. Qui-Gon Jinn: Our DSN 2018 paper on reliability of Wear OS. This software allows for fuzzing of the apps, leading to crashes, hangs (of the apps) and even reboot of the smartwatch.

    How Reliable is my Wearable: A Fuzz Testing-based Study,” Edgardo Barsallo Yi, Amiya K. Maji, Saurabh Bagchi. At the 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 410-417, June 25-28, 2018, Luxembourg City, Luxembourg. (Acceptance rate: 62/221 = 28.1%)

  6. ACES: Our Usenix Security 2018 paper on security in bare-metal embedded systems. It creates compartments out of off-the-shelf embedded software. The distribution targets ARM Cortex-M4 class of devices.

    ACES: Automatic Compartments for Embedded Systems,” Abraham A. Clements, Naif Saleh Almakhdhub, Saurabh Bagchi, and Mathias Payer. At the 27th USENIX Security Symposium (USENIX Sec), pp. 65-82, August 15-17, 2018, Baltimore, MD. (Acceptance rate: 100/524 = 19.1%)

  7. TATHYA: This is the dataset that accompanies our CIKM 2017 paper on automated fact checking. This dataset contains manually annotated statements from the US Presidential debates for the 2016 election. [ README ] [ Full thesis ]

    A Multi-Classifier System for Detecting Check-Worthy Statements in Political Debates,” Ayush Patwari, Dan Goldwasser, and Saurabh Bagchi. At the 26th  ACM International Conference on Information and Knowledge Management (CIKM) (Short paper), pp. 2259-2262, Nov 6-10, 2017, Singapore. (Acceptance rate: 119/398 = 29.9% (short papers))

  8. Rafiki: Our Middleware 2017 paper showing how one can find the optimized parameter settings for a NoSQL database (Cassandra in our case) when the workload characteristics change.

    Rafiki: A Middleware for Parameter Tuning of NoSQL Datastores for Dynamic Metagenomics Workloads,” Ashraf Mahgoub, Paul Wood, Sachandhan Ganesh, Subrata Mitra (Adobe Research), Wolfgang Gerlach (Argonne National Laboratory), Travis Harrison (Argonne National Laboratory), Folker Meyer (Argonne National Laboratory), Ananth Grama, Saurabh Bagchi, and Somali Chaterji. At the ACM/IFIP/USENIX Middleware Conference, pp. 28-40, Dec 11-15, 2017, Las Vegas, Nevada. (Acceptance rate: 20/85 = 23.5%)

  9. FRESCO: This is the open source data repository of system usage and failure information for Purdue’s centralized computing clusters. It contains anonymized data from 3+ Million jobs in 2015-2017.

    A Study of Failures in Community Clusters: The Case of Conte,” Subrata Mitra, Suhas Raveesh Javagal, Amiya K. Maji (ITaP), Todd Gamblin (LLNL), Adam Moody (LLNL), Stephen Harrell (ITaP), and Saurabh Bagchi. At the 7th IEEE International Workshop on Program Debugging, co-located with ISSRE, pp. 1-8, Oct 23-27, 2016, Ottawa, Canada.

  10. EPOXY: Our Security and Privacy 2017 paper on security in bare-metal embedded systems. It executes code at the unprivileged level for the most part, except for the small amounts of critical regions. The distribution is for ARM Cortex-M4 class of devices.

    Protecting Bare-metal Embedded Systems with Privilege Overlays,” Abraham A Clements, Naif Saleh Almakhdhub, Khaled Saab (Georgia Tech), Prashast Srivastava, Jinkyu Koo, Saurabh Bagchi, and Mathias Payer. In Proceedings of the IEEE International Symposium on Security and Privacy (Oakland/S&P), pp. 289-303, May 22-24, 2017, San Jose, California. (Acceptance rate: 60/450 = 13.3%)

  11. ScalaDBG: Our BCB 2017 paper on how to do genomic assembly in a distributed manner. It designs for building De Bruijn graphs with different k values in parallel and then merging them in. The distribution is built on top of the IDBA assembly algorithm.

    Scalable Genomic Assembly through Parallel de Bruijn Graph Construction for Multiple K-mers,” Kanak Mahadik, Christopher Wright, Milind Kulkarni, Saurabh Bagchi, Somali Chaterji. In Proceedings of the 8th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB), pp. 425-431, Aug 20-23, 2017, Boston, MA.

Last modified: September 26, 2022

Download Software