Forking Without Clicking: on How to Identify Software Repository Forks
MSR - Technical Paper
The notion of software “fork” has been shifting over time from the (negative) phenomenon of community disagreements that result in the creation of separate development lines and ultimately software products, to the (positive) practice of using distributed version control system (VCS) repositories to collaboratively improve a single software product without stepping on each others toes in the interim. Either way, VCS repositories involved in a fork share parts of a common development history.
Historically, studies of software forks have relied on hosting platform metadata, and most notably GitHub, as the source of truth for what constitutes a fork. However, these “forge forks” can only identify as forks repositories that have been created on the platform, e.g., by clicking the “fork” button on the GitHub user interface. The increase in popularity of more distributed code hosting platforms (e.g., GitLab) and the habits of significant development communities (e.g., the Linux kernel one, which is not primarily hosted on any single centralized platform) call into question the reliability of trusting hosting platforms to determine what a fork is. Doing so might introduce selection and methodological biases in empirical studies.
In this article we explore various definitions of “software forks”, trying to capture the various forking workflows that exist in the real world. We quantify the differences in how many repositories would be identified as forks on GitHub according to the various definitions, confirming that a significant number would be overlooked when only considering “forge forks”. We study the structure of fork networks, observing how their size is affected by the proposed definitions and discuss the potential impacts of these results on empirical research.
Tue 30 Jun Times are displayed in time zone: (UTC) Coordinated Universal Time change
|10:30 - 10:37|
Jens MeinickeCarnegie Mellon University, Juan HoyosUniversidad Nacional de Colombia, Bogdan VasilescuCarnegie Mellon University, Christian KästnerCarnegie Mellon UniversityPre-print Media Attached
|10:37 - 10:45|
Antoine PietriInria, Guillaume RousseauUniversité de Paris and Inria, Stefano ZacchiroliUniversité de Paris and InriaPre-print Media Attached
|10:45 - 10:52|
|Pre-print Media Attached|
|10:52 - 11:00|
Employing Contribution and Quality Metrics for Quantifying the Software Development ProcessMSR - Data Showcase
A: Themistoklis DiamantopoulosElectrical and Computer Engineering Dept, Aristotle University of Thessaloniki, A: Michail Papamichail , A: Thomas Karanikiotis, A: Kyriakos Chatzidimitriou Aristotle University of Thessaloniki, A: Andreas SymeonidisAristotle University of ThessalonikiPre-print Media Attached