BUGLinks - Bug Reports and Their Links to Software
Repositories
Methodology
Information
Retrieval (IR) Based approaches to Bug Localization (BL) constitue an important
research area in software development and maintenance and many researchers
presented promising results in recent years in this area. In order to evaluate
IR-based algorithms for BL, we need a set of bug reports and the files modified
to fix the corresponding bugs as the ground truth that should be retrieved in
response to each bug. Unfortunately, the bug tracking databases such as Bugzilla do not store the actual modifications committed
for the reported bugs. Therefore, there is an increasing need for large
datasets that reconstruct the links between the software repositories and the
bug tracking databases for efficient and convincing evaluation of IR-based BL
algorithms. This dataset is prepared with the hope that it would be helpful to
advance this area of research by providing a large number of bug reports and the
corresponding modification efforts in the software repositories of well-known,
commonly used open-source software projects.
The
common approach to link the modifications to the bug reports is to look for
pointers in the commit messages to the bug tracking database. For the Eclipse
project, we follow this approach and employ regular expressions for an accurate
reconstruction of the links between the bug reports and the corresponding
modifications as follows:
· Scanning the repository logs, group the files that are modified by the same author with the same commit message with a time fuzziness of 200 seconds. This step is necessary as CVS repositories store the changes made to each file separately.
· For the target version of the software, use regular expressions to extract the bug IDs from the commit messages. Our regular expressions match the following generic phrases in the commit messages: Fix for ID, Fix ID, Fixed ID, Fixing ID, Bug ID and they are insensitive to the spacing and the punctuation characters such as ``:" or ``#" within the identified phrase i.e. the phrases BugID, Bug: ID or Bug #ID and so on are also matched.
· Check if the extracted ID exists in the bug tracking database for the target version of the software.
We only consider the bug reports that are marked "FIXED" in the Bugzilla database including "VERIFIED FIXED" and "RESOLVED FIXED"
For
Google Chrome, the commit messages follow a more specific form. They contain a
separate line to indicate whether the commit fixes any bugs and if so the bug
IDs are given in that line.
Note that a bug may need more than one set of modifications to be finally resolved. Therefore, we accumulate the set of modifications that are committed for the same bug over the analysis period.
Dataset
Statistics
Project |
Language |
#Bugs |
Analysis Period |
Average
#Revisions per Bug |
Average
#Fixed Files Per Bug |
Eclipse v3.1 |
Java |
4,650 |
2001-04-28 - 2010-05-21 |
6,331 / 4,650 = 1.36 |
18,242 / 4,650 = 3.92 |
Chrome v4.0 |
C/C++ |
396 |
2008-07-25 - 2010-05-20 |
490 / 396 = 1.23 |
1,903 / 396 = 4.81 |
Note that for some bugs there are no files fixed in the source code of the evaluated project. This is because those bugs are either reported on a test package or they require changes in only non-executables such as XML files. We do not use those bug reports in the experiments presented in our paper. However, we provide all the bugs here as extracted by our regular expression-based algorithm as some people may find those bug reports useful in their studies.
Figure 1: An excerpt from the Dataset for Eclipse
Download
If you use BUGLinks during the course of your research, please cite:
B. Sisman and A. Kak, “Assisting Code Search with Automatic Query Reformulation for Bug Localization” in Procedings of 10th Working Conference on Mining Software Repositories (MSR), 2013.
The contents of this dataset are subject to the GNU GPL v3 (“the License”). You may not use this dataset except in compliance with the License. The dataset distributed under the License is distributed on an “AS IS” basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License for the specific language governing rights and limitations under the License.
Contact
Us
If you have any questions or suggestions on improving the usability of the dataset, please contact Bunyamin Sisman.
Acknowledgement
This research is funded by Infosys.