moreBugs


Downloading the Dataset

The entire dataset is of size 2GB.

The dataset is free to use.

However, for our record keeping purposes, we request you to fill out a License agreement form and send the same to Shivani Rao before gaining access to the dataset. Upon receipt of this form you will be immediately assigned a username and a password for the dataset that you can subsequently download form here.

If you would just like to see a small sample of the dataset, you can download it from here without any restrictions. The size of the sample dataset is 18.6MB.


What is moreBugs?

One of the ongoing research areas in this laboratory is "Retrieval from Software libraries for Bug Localization." Bug Localization means to locate the source code artifacts that may be responsible for the abnormal behavior of a program as reported in a given bug and its related information. Retrieval algorithms when applied to bug localization cast it as a search task. The software library is treated as a database of documents that is searched with a bug's textural description as a query with the expectation that the relevant files responsible for the abnormal behavior of the software would be retrieved. In order to evaluate such approaches to bug localization, one needs a ground-truth dataset that contains a list of bugs, and, for each bug, contains (i) a textual description of the bug, (ii) a patch-list indicating the set of files that were modified, (iii) and the underlying software repository. This web-page presents a dataset that contains this information.

Note that moreBugs is a superset of iBugs with regard to information retrieval based algorithms for mining software repositories. The iBugs datasets has been used in a number of investigations dealing with bug localization and prediction. Whereas iBugs is based on the information gleaned from bug tracking systems (along with version histories at the revision points that correspond to bug-fixes), moreBugs contains all of the version histories associated with a software library. This makes moreBugs useful for benchmarking the more modern approaches to impact analysis that link bug-proneness of a library at any given point in time to the revision histories of the files over all of the past revisions. We expect moreBugs to also be valuable in the benchmarking of algorithms for change detection, impact analysis, study of software vocabulary evolution and so on.

What does moreBugs contain?

The moreBugs dataset was derived from the AspectJ and JodaTime repositories. Here is a summary of what is in moreBugs:

AspectJ JodaTime
Version Control System Git Git
Number of tags/releases 77 32
Number of revisions 7477 1537
Total Duration of the project analyzed Dec'02- Feb'12 Dec'03- June'12
Bug tracking system Bugzilla SourceForge
Number of bugs mined from VCS 450 57
Number of bugs found in the Bug Tracking System 350 45


For each of the software libraries, moreBugs contains the following:

Why should You Use moreBugs?

If your goal is to use a source-code dataset for research in basic IR based retrieval algorithms, you'll find moreBugs to be conceptually similar to its predecessor iBugs. The main reason you may wish to use moreBugs as opposed to iBugs would be that whereas moreBugs gives you preprocessed text files corresponding to the source code files, iBugs gives you just the raw source code files. By preprocessing we mean eliminating numeric strings, Unicode characters, and special characters; breaking camel case strings into their components; splitting the identifiers when they consist of strings joined by underscores; and so on. (With both iBugs and moreBugs, you get a pre-fix snapshot of the source code library for each bug.) Additionally, do keep in mind that moreBugs is a larger dataset with 10 years of history built into it.

If you are engaged in research in incremental algorithms for IR based retrieval (as for example, described in the publications Rao2013a and Rao2013b) you will find moreBugs dataset to be indispensable. The reason for that is the relationship between the version number when a bug is first reported and the version number when a developer attempts to fix the bug. Assume for a moment that a bug was reported in Version 3.1 of a repository and you are trying to fix it when the repository is in Version 3.9. The very first thing you would do would be to try reproducing the fault in Version 3.9. If you cannot reproduce the fault (because of the modification to the software in the intervening versions), you would simply declare the bug irreproducible and therefore closed. However, if you can reproduce the fault, the question now is as to which version source files you should examine for fixing the bug. If your job is to just fix the bug, you'd obviously work on Version 3.9 of the files. What that implies is that the patch files you create will be based on Version 3.9. Consequently, the ground truth data for the bug you just fixed would need to be based on Version 3.9. In other words, the model of the repository for this particular bug would need to be based on Version 3.9 and NOT on Version 3.1. In this example scenario, while it is true that iBugs would give you access to Version 3.9 of the repository, moreBugs would also give you access to the state of the repository through each commit up to Version 3.9. So if you created an IR model for the repository when it was in version 3.1, you may be able to carry the model forward to Version 3.9 with incremental updating through all of the commits up to Version 3.9 and you'd do so with minimal additional computations.

With regard to the relationship between a bug report and the repository revision in which the bug was fixed, it is important to note that this information cannot always be extracted from the bug report itself. This is illustrated by the bug report shown in Figure 5 of our tech report that is mentioned below. You'll notice in that bug report that the field "target milestone" that should show the revision number in which the bug was fixed is empty. However, this information can be extracted from the commit history as explained in Section IV.A of the same tech report.

Note that evaluating IR algorithms, in general, is time consuming because of the time it takes to parse the source files, to create an index, and to then learn the model. With the moreBugs dataset, you are saved the time that would otherwise go into these steps at each commit for a repository.

A Serious Issue with Some Current Publications on IR-Based Approaches to Automatic Bug Localization

Some papers that have been published recently in prestigious venues suffer from a serious shortcoming that can be explained in the following manner: Suppose you want to report the performance of your IR based bug localization algorithm on a set of 100 bugs. Just for the sake of making a theoretical point, assume that these 100 bugs belong to 100 different revisions of a repository, with some of these revisions corresponding to the official release versions and the others to regular commits. For a rigorous evaluation of bug localization performance, you should be constructing 100 different models of the repository, one model for each bug in your set of 100 bugs. But this is not what several researchers have done in their publications. These researchers create a single model from just one of the release versions of the repository and then compute the bug localization performance for all 100 bugs with respect to just that model. This obviously makes it easier to carry out the evaluation. However, on account of file deletions, additions, and modifications between the different revisions, any performance results that are reported in this manner cannot be fully trusted.

The moreBugs dataset makes it easier for you to create and maintain multiple models for such performance evaluation scenarios.

What Text Preprocessing Steps are Not Included in the Preprocessed Files Provided by moreBugs?

We have not included stemming and stop-word removal in our preprocessing because these are active research areas in their own right. There are a number of choices available today for stemming; see, for example, the following publication: On the Use of Stemming for Concern Location and Bug Localization in Java. Similarly, the stop words tend to be domain and application specific. We thought it would be best if we gave the researchers the option to use their own favorite stemming algorithm and customized stop-word list.

Downloading the Technical Report

The details of how moreBugs was created and the specific instructions on using it for your purposes is available in the Technical Report.

How to Cite moreBugs

@techreport{moreBugs,
     title = {{moreBugs: A {N}ew {D}ataset for
              {B}enchmarking {A}lgorithms for 
              {I}nformation {R}etrieval from 
              {S}oftware {R}epositories} (TR-ECE-13-07)},
     author = {Shivani Rao and Avinash Kak},
     year = {2013},
     institution = {Purdue University,
                   School of Electrical and
                   Computer Engineering},
     month = {04},
     Date-Added = {2013-04-23},
}

Extracting Original Source Files

In the sample or the overall dataset that we have provided, you will find that the original source files are missing. This is done so as to avoid duplication and save space. However, if you really need the original source files, we provide a script so that you can extract the source files corresponding to a revision, bug or a tag. In order to run these scripts you will need to install R.

Step1: The first step is to clone the repository into your home directory

          /home/shivani$ git clone https://github.com/eclipse/org.aspectj.git

          This will yield an empty git repository at

          /home/shivani/org.aspectj/.git

Step2: Next use this R script to extract the original source files

People

Support

In case you have any difficulties please email your questions and concerns to Shivani Rao.