MSR 2020
Mon 29 - Tue 30 June 2020
co-located with ICSE 2020
Mon 29 Jun 2020 16:47 - 16:55 at MSR:Zoom - Github & OSS Datasets Chair(s): Olga Baysal

In order to understand the state and evolution of the entirety of open source software we need to get a handle on the set of distinct software projects. Most of open source projects presently utilize Git, which is a distributed version control system allowing easy creation of clones and resulting in numerous repositories that are almost entirely based on some parent repository from which they were cloned. Git commits are based on Merkle Tree and two commits are highly unlikely to be produced independently. Shared commits, therefore, appear like an excellent way to group cloned repositories and obtain an accurate map for such repositories. We use World of Code infrastructure containing approximately 2B commits and 100M repositories to create and share such a map. We discover that the largest group contains almost 14M repositories most of which are unrelated to each other. As it turns out, the developers can push git object to an arbitrary repository or pull objects from unrelated repositories, thus linking unrelated repositories. To address this, we apply Louvain community detection algorithm to this very large graph consisting of links between commits and projects. The approach successfully reduces the size of the megacluster with the largest group of highly interconnected projects containing under 100K repositories. We expect the tools that the resulting map of related projects as well as tools and methods to handle the very large graph will serve as a reference set for mining software projects and other applications. Further work is needed to determine different types of relationships among projects induced by shared commits and other relationships, for example, by shared source code or similar filenames.

Mon 29 Jun
Times are displayed in time zone: (UTC) Coordinated Universal Time change

msr-2020-papers
16:30 - 17:30: Technical Papers - Github & OSS Datasets at MSR:Zoom
Chair(s): Olga BaysalCarleton University

Q/A & Discussion of Session Papers over Zoom (Joining info available on Slack)

msr-2020-Data-showcase16:30 - 16:38
Live Q&A
Xunhui ZhangNational University of Defense Technology, China, Ayushi RastogiPostdoctoral researcher at TU Delft, Yue YuCollege of Computer, National University of Defense Technology, Changsha 410073, China
Pre-print Media Attached
msr-2020-Data-showcase16:38 - 16:47
Live Q&A
Usman Ashraf, Christoph Mayr-DornJohannes Kepler University Linz, Alexander EgyedJohannes Kepler University, Linz, Sebastiano Panichella
Media Attached
msr-2020-Data-showcase16:47 - 16:55
Live Q&A
Audris Mockus, Zoe KottiAthens University of Economics and Business, Diomidis SpinellisAthens University of Economics and Business, Gabriel Dusing
Media Attached
msr-2020-Data-showcase16:55 - 17:04
Live Q&A
Diomidis SpinellisAthens University of Economics and Business, Zoe KottiAthens University of Economics and Business, Konstantinos Kravvaritis, Georgios Theodorou, Panos Louridas Athens University of Economics and Business
DOI Pre-print Media Attached
msr-2020-Data-showcase17:04 - 17:12
Live Q&A
Diomidis SpinellisAthens University of Economics and Business, Zoe KottiAthens University of Economics and Business, Audris Mockus
DOI Pre-print Media Attached
msr-2020-Data-showcase17:12 - 17:21
Live Q&A
Tanner Fry, Tapajit Dey, Andrey KarnauchUniversity of Tennessee Knoxville, Audris Mockus
Pre-print Media Attached
msr-2020-Data-showcase17:21 - 17:30
Live Q&A
Maëlick Claes University of Oulu, Mika MäntyläUniversity of Oulu
Media Attached