A Dataset for GitHub Repository Deduplication (MSR 2020 - Data Showcase) - MSR 2020

Mon 29 - Tue 30 June 2020

co-located with ICSE 2020

Who

Diomidis Spinellis, Zoe Kotti, Audris Mockus

Track

MSR 2020 Data Showcase

Time Zone

The program is currently displayed in (UTC) Coordinated Universal Time.

Use conference time zone: (UTC) Coordinated Universal TimeSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

When

Mon 29 Jun 2020 17:04 - 17:12 at MSR:Zoom - Github & OSS Datasets Chair(s): Olga Baysal

Abstract

GitHub projects can be easily replicated through the site’s fork process or through a Git clone-push sequence. This is a problem for empirical software engineering, because it can lead to skewed results or mistrained machine learning models. We provide a dataset of 10.6 million GitHub projects that are copies of others, and link each record with the project’s ultimate parent. The ultimate parents were derived from a ranking along six metrics. The project cliques were calculated as the connected components of an 18.2 million node and 12 million edge denoised graph created by directing edges to ultimate parents. The graph was created by filtering out more than 30 hand-picked and 2.3 million pattern-matched clumping projects. Projects that introduced unwanted clumping were identified by repeatedly visualizing shortest path distances between unrelated important projects. Our dataset identified 30 thousand duplicate projects in an existing popular reference dataset of 1.8 million projects. An evaluation of our dataset against another created independently with different methods found a significant overlap, but also differences attributed to the operational definition of what projects are considered as related.

Link to Preprint

https://doi.org/10.5281/zenodo.3740595

DOI

https://doi.org/10.1145/3379597.3387496

Diomidis SpinellisAuthor

Athens University of Economics and Business

Greece

Zoe KottiAuthor

Athens University of Economics and Business

Greece

Audris MockusAuthor

Media

Time Zone

The program is currently displayed in (UTC) Coordinated Universal Time.

Use conference time zone: (UTC) Coordinated Universal TimeSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Session Program

Mon 29 Jun
Displayed time zone: (UTC) Coordinated Universal Time change

	16:30 - 17:30	Github & OSS DatasetsRegistered Reports / Keynote / MSR Awards / FOSS Award / Education / Data Showcase / Mining Challenge / MSR Challenge Proposals / Ask Me Anything / Technical Papers at MSR:Zoom Chair(s): Olga Baysal Carleton University Q/A & Discussion of Session Papers over Zoom (Joining info available on Slack)

	16:30 8m Live Q&A		A New Dataset for Pull Request AcceptanceMSR - Data Showcase Data Showcase A: Xunhui Zhang National University of Defense Technology, China, A: Ayushi Rastogi University of Groningen, The Netherlands, A: Yue Yu College of Computer, National University of Defense Technology, Changsha 410073, China Pre-print Media Attached
	16:38 8m Live Q&A		A Mixed Graph-Relational Dataset of Socio-technicalInteractions in Open Source SystemsMSR - Data Showcase Data Showcase A: Usman Ashraf , A: Christoph Mayr-Dorn Johannes Kepler University Linz, A: Alexander Egyed Johannes Kepler University, Linz, A: Sebastiano Panichella Media Attached
	16:47 8m Live Q&A		A Complete Set of Related Git Repositories Identified via Community Detection Approaches Based on Shared CommitsMSR - Data Showcase Data Showcase A: Audris Mockus , A: Zoe Kotti Athens University of Economics and Business, A: Diomidis Spinellis Athens University of Economics and Business, A: Gabriel Dusing Media Attached
	16:55 8m Live Q&A		A Dataset of Enterprise-Driven Open Source SoftwareMSR - Data Showcase Data Showcase A: Diomidis Spinellis Athens University of Economics and Business, A: Zoe Kotti Athens University of Economics and Business, A: Konstantinos Kravvaritis , A: Georgios Theodorou , A: Panos Louridas Athens University of Economics and Business DOI Pre-print Media Attached
	17:04 8m Live Q&A		A Dataset for GitHub Repository DeduplicationMSR - Data Showcase Data Showcase A: Diomidis Spinellis Athens University of Economics and Business, A: Zoe Kotti Athens University of Economics and Business, A: Audris Mockus DOI Pre-print Media Attached
	17:12 8m Live Q&A		A Dataset and an Approach for Identity Resolution of 38 Million Author IDs extracted from 2B Git CommitsMSR - Data Showcase Data Showcase A: Tanner Fry , A: Tapajit Dey , A: Andrey Karnauch University of Tennessee Knoxville, A: Audris Mockus Pre-print Media Attached
	17:21 8m Live Q&A		20-MAD - 20 years of issues and commits of Mozilla and Apache DevelopmentMSR - Data Showcase Data Showcase A: Maëlick Claes University of Oulu, A: Mika Mäntylä University of Oulu Media Attached