Forking Without Clicking: on How to Identify Software Repository ForksMSR - Technical Paper
The notion of software “fork” has been shifting over time from the (negative) phenomenon of community disagreements that result in the creation of separate development lines and ultimately software products, to the (positive) practice of using distributed version control system (VCS) repositories to collaboratively improve a single software product without stepping on each others toes in the interim. Either way, VCS repositories involved in a fork share parts of a common development history.
Historically, studies of software forks have relied on hosting platform metadata, and most notably GitHub, as the source of truth for what constitutes a fork. However, these “forge forks” can only identify as forks repositories that have been created on the platform, e.g., by clicking the “fork” button on the GitHub user interface. The increase in popularity of more distributed code hosting platforms (e.g., GitLab) and the habits of significant development communities (e.g., the Linux kernel one, which is not primarily hosted on any single centralized platform) call into question the reliability of trusting hosting platforms to determine what a fork is. Doing so might introduce selection and methodological biases in empirical studies.
In this article we explore various definitions of “software forks”, trying to capture the various forking workflows that exist in the real world. We quantify the differences in how many repositories would be identified as forks on GitHub according to the various definitions, confirming that a significant number would be overlooked when only considering “forge forks”. We study the structure of fork networks, observing how their size is affected by the proposed definitions and discuss the potential impacts of these results on empirical research.