BUGLinks - Bug Reports and Their Links to Software Repositories



Information Retrieval (IR) Based approaches to Bug Localization (BL) constitue an important research area in software development and maintenance and many researchers presented promising results in recent years in this area. In order to evaluate IR-based algorithms for BL, we need a set of bug reports and the files modified to fix the corresponding bugs as the ground truth that should be retrieved in response to each bug. Unfortunately, the bug tracking databases such as Bugzilla do not store the actual modifications committed for the reported bugs. Therefore, there is an increasing need for large datasets that reconstruct the links between the software repositories and the bug tracking databases for efficient and convincing evaluation of IR-based BL algorithms. This dataset is prepared with the hope that it would be helpful to advance this area of research by providing a large number of bug reports and the corresponding modification efforts in the software repositories of well-known, commonly used open-source software projects.


The common approach to link the modifications to the bug reports is to look for pointers in the commit messages to the bug tracking database. For the Eclipse project, we follow this approach and employ regular expressions for an accurate reconstruction of the links between the bug reports and the corresponding modifications as follows:


·         Scanning the repository logs, group the files that are modified by the same author with the same commit message with a time fuzziness of 200 seconds. This step is necessary as CVS repositories store the changes made to each file separately.

·         For the target version of the software, use regular expressions to extract the bug IDs from the commit messages. Our regular expressions match the following generic phrases in the commit messages: Fix for ID, Fix ID, Fixed ID, Fixing ID, Bug ID and they are insensitive to the spacing and the punctuation characters such as ``:" or ``#" within the identified phrase i.e. the phrases BugID, Bug: ID or Bug #ID and so on are also matched.

·         Check if the extracted ID exists in the bug tracking database for the target version of the software.


We only consider the bug reports that are marked "FIXED" in the Bugzilla database including "VERIFIED FIXED" and "RESOLVED FIXED"


For Google Chrome, the commit messages follow a more specific form. They contain a separate line to indicate whether the commit fixes any bugs and if so the bug IDs are given in that line.


Note that a bug may need more than one set of modifications to be finally resolved. Therefore, we accumulate the set of modifications that are committed for the same bug over the analysis period.

Dataset Statistics




Analysis Period

Average #Revisions per Bug

Average #Fixed Files Per Bug

Eclipse v3.1



2001-04-28 - 2010-05-21

6,331 / 4,650 = 1.36

18,242 / 4,650 = 3.92

Chrome v4.0



2008-07-25 - 2010-05-20

490 / 396 = 1.23

1,903 / 396 = 4.81

Note that for some bugs there are no files fixed in the source code of the evaluated project. This is because those bugs are either reported on a test package or they require changes in only non-executables such as XML files. We do not use those bug reports in the experiments presented in our paper. However, we provide all the bugs here as extracted by our regular expression-based algorithm as some people may find those bug reports useful in their studies.


Figure 1: An excerpt from the Dataset for Eclipse



If you use BUGLinks during the course of your research, please cite:

B. Sisman and A. Kak, “Assisting Code Search with Automatic Query Reformulation for Bug Localization” in Procedings of 10th Working Conference on Mining Software Repositories (MSR), 2013.

Download BUGLinks

The contents of this dataset are subject to the GNU GPL v3 (“the License”). You may not use this dataset except in compliance with the License. The dataset distributed under the License is distributed on an “AS IS” basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License for the specific language governing rights and limitations under the License.

Contact Us

If you have any questions or suggestions on improving the usability of the dataset, please contact Bunyamin Sisman.



This research is funded by Infosys.