However, for our record keeping
purposes, we request you to fill out
a License agreement
form and send the same
to Shivani Rao before
gaining access to the dataset. Upon receipt of this form you
will be immediately assigned a username and a password for
the dataset that you can subsequently download
form here.
If you would just like to see a small sample of the dataset,
you can download it from
here without
any restrictions. The size of the sample dataset is 18.6MB.
One of the ongoing research areas in this laboratory is
"Retrieval from Software libraries for Bug Localization."
Bug Localization means to locate the source code artifacts
that may be responsible for the abnormal behavior of a
program as reported in a given bug and its related
information. Retrieval algorithms when applied to bug
localization cast it as a search task. The software library
is treated as a database of documents that is searched with
a bug's textural description as a query with the expectation
that the relevant files responsible for the abnormal
behavior of the software would be retrieved. In order to
evaluate such approaches to bug localization, one needs a
ground-truth dataset that contains a list of bugs, and, for
each bug, contains (i) a textual description of the bug,
(ii) a patch-list indicating the set of files that were
modified, (iii) and the underlying software repository. This
web-page presents a dataset that contains this
information.
Note that moreBugs is a superset
of iBugs
with regard to information retrieval based algorithms for
mining software repositories. The iBugs datasets has been
used in a number of investigations dealing with bug
localization and prediction. Whereas iBugs is based on the
information gleaned from bug tracking systems (along with
version histories at the revision points that correspond to
bug-fixes), moreBugs contains all of the version histories
associated with a software library. This makes moreBugs
useful for benchmarking the more modern approaches to impact
analysis that link bug-proneness of a library at any given
point in time to the revision histories of the files over
all of the past revisions. We expect moreBugs to also be
valuable in the benchmarking of algorithms for change
detection, impact analysis, study of software vocabulary
evolution and so on.
If your goal is to use a source-code dataset for research in
basic IR based retrieval algorithms, you'll find moreBugs to
be conceptually similar to its predecessor iBugs. The main
reason you may wish to use moreBugs as opposed to iBugs
would be that whereas moreBugs gives you preprocessed text
files corresponding to the source code files, iBugs gives
you just the raw source code files. By preprocessing we mean
eliminating numeric strings, Unicode characters, and special
characters; breaking camel case strings into their
components; splitting the identifiers when they consist of
strings joined by underscores; and so on. (With both iBugs
and moreBugs, you get a pre-fix snapshot of the source code
library for each bug.) Additionally, do keep in mind that
moreBugs is a larger dataset with 10 years of history built
into it.
If you are engaged in research in incremental
algorithms for IR based retrieval (as for example, described in the publications Rao2013a and Rao2013b)
you will find moreBugs dataset to be indispensable. The reason for that is
the relationship between the version number when a bug is
first reported and the version number when a developer
attempts to fix the bug. Assume for a moment that a bug was
reported in Version 3.1 of a repository and you are trying
to fix it when the repository is in Version 3.9. The very
first thing you would do would be to try reproducing the
fault in Version 3.9. If you cannot reproduce the fault
(because of the modification to the software in the
intervening versions), you would simply declare the bug
irreproducible and therefore closed. However, if you can
reproduce the fault, the question now is as to which version
source files you should examine for fixing the bug. If your
job is to just fix the bug, you'd obviously work on Version
3.9 of the files. What that implies is that the patch files
you create will be based on Version 3.9. Consequently, the
ground truth data for the bug you just fixed would need to
be based on Version 3.9. In other words, the model of
the repository for this particular bug would need to be
based on Version 3.9 and NOT on Version 3.1. In this
example scenario, while it is true that iBugs would give you
access to Version 3.9 of the repository, moreBugs would also
give you access to the state of the repository through each
commit up to Version 3.9. So if you created an IR model for
the repository when it was in version 3.1, you may be able
to carry the model forward to Version 3.9 with incremental
updating through all of the commits up to Version 3.9 and
you'd do so with minimal additional computations.
With regard to the relationship between a bug report and the
repository revision in which the bug was fixed, it is
important to note that this information cannot always be
extracted from the bug report itself. This is illustrated by
the bug report shown in Figure 5 of our tech report that is
mentioned below. You'll notice in that bug report that the
field "target milestone" that should show the revision
number in which the bug was fixed is empty. However, this
information can be extracted from the commit history as
explained in Section IV.A of the same tech report.
Note that evaluating IR algorithms, in general, is time
consuming because of the time it takes to parse the source
files, to create an index, and to then learn the model. With
the moreBugs dataset, you are saved the time that would
otherwise go into these steps at each commit for a
repository.
Some papers that have been published recently in prestigious
venues suffer from a serious shortcoming that can be
explained in the following manner: Suppose you want to
report the performance of your IR based bug localization
algorithm on a set of 100 bugs. Just for the sake of making
a theoretical point, assume that these 100 bugs belong to
100 different revisions of a repository, with some of these
revisions corresponding to the official release versions and
the others to regular commits. For a rigorous evaluation of
bug localization performance, you should be constructing 100
different models of the repository, one model for each bug
in your set of 100 bugs. But this is not what several
researchers have done in their publications. These
researchers create a single model from just one of the
release versions of the repository and then compute the bug
localization performance for all 100 bugs with respect to
just that model. This obviously makes it easier to carry out
the evaluation. However, on account of file deletions,
additions, and modifications between the different
revisions, any performance results that are reported in this
manner cannot be fully trusted.
The moreBugs dataset makes it easier for you to create and
maintain multiple models for such performance evaluation
scenarios.
We have not included stemming and stop-word removal in our
preprocessing because these are active research areas in
their own right. There are a number of choices available
today for stemming; see, for example, the following
publication: On
the Use of Stemming for Concern Location and Bug
Localization in Java. Similarly, the stop words tend
to be domain and application specific. We thought it would
be best if we gave the researchers the option to use their
own favorite stemming algorithm and customized stop-word
list.
@techreport{moreBugs,
title = {{moreBugs: A {N}ew {D}ataset for
{B}enchmarking {A}lgorithms for
{I}nformation {R}etrieval from
{S}oftware {R}epositories} (TR-ECE-13-07)},
author = {Shivani Rao and Avinash Kak},
year = {2013},
institution = {Purdue University,
School of Electrical and
Computer Engineering},
month = {04},
Date-Added = {2013-04-23},
}
In the sample or the overall dataset that we have provided, you will find that the original source files are missing. This is done so as to avoid duplication and save space. However, if you really need the original source files, we provide a script so that you can extract the source files corresponding to a revision, bug or a tag. In order to run these scripts you will need to install R.
Step1: The first step is to clone the repository into your home directory