Forking Without Clicking: on How to Identify Software Repository Forks (MSR 2020 - Technical Papers)

Who

Antoine Pietri, Guillaume Rousseau, Stefano Zacchiroli

Track

MSR 2020 Technical Papers

Time Zone

The program is currently displayed in (UTC) Coordinated Universal Time.

Use conference time zone: (UTC) Coordinated Universal TimeSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 30 Jun 2020 10:37 - 10:45 at MSR:Zoom - Evolution Chair(s): Jürgen Cito

Abstract

The notion of software “fork” has been shifting over time from the (negative) phenomenon of community disagreements that result in the creation of separate development lines and ultimately software products, to the (positive) practice of using distributed version control system (VCS) repositories to collaboratively improve a single software product without stepping on each others toes in the interim. Either way, VCS repositories involved in a fork share parts of a common development history.

Historically, studies of software forks have relied on hosting platform metadata, and most notably GitHub, as the source of truth for what constitutes a fork. However, these “forge forks” can only identify as forks repositories that have been created on the platform, e.g., by clicking the “fork” button on the GitHub user interface. The increase in popularity of more distributed code hosting platforms (e.g., GitLab) and the habits of significant development communities (e.g., the Linux kernel one, which is not primarily hosted on any single centralized platform) call into question the reliability of trusting hosting platforms to determine what a fork is. Doing so might introduce selection and methodological biases in empirical studies.

In this article we explore various definitions of “software forks”, trying to capture the various forking workflows that exist in the real world. We quantify the differences in how many repositories would be identified as forks on GitHub according to the various definitions, confirming that a significant number would be overlooked when only considering “forge forks”. We study the structure of fork networks, observing how their size is affected by the proposed definitions and discuss the potential impacts of these results on empirical research.

Link to Preprint

https://hal.inria.fr/hal-02527811/document

Antoine Pietri

Inria

France

Guillaume Rousseau

Université de Paris and Inria

Stefano Zacchiroli