How Often Do Single-Statement Bugs Occur? The ManySStuBs4J Dataset
MSR - Data Showcase
Program repair is an important but difficult software engineering problem. One way to achieve acceptable performance is to focus on classes of simple bugs, such as bugs with single statement fixes, or that match a small set of bug templates. However, it is very difficult to estimate the recall of repair techniques for simple bugs, as there are no datasets about how often the associated bugs occur in code. To fill this gap, we provide a dataset of 153652 single statement bug-fix changes mined from 1000 popular open-source Java projects, annotated by whether they match any of a set of 16 bug templates, inspired by state-of-the-art program repair techniques. We also administer a repository of Maven dependencies for a subset of 100 projects that use the Maven build system. In an initial analysis, we find that about 33% of the simple bug fixes match the templates, indicating that a remarkable number of single-statement bugs can be repaired with a relatively small set of templates. Further, we find that template fitting bugs appear with a frequency of about one bug per 1600-2500 lines of code (as measured by the size of the project’s latest version). We hope that this dataset will prove a resource both for future work in automatic program repair and also for future studies in empirical software engineering.