The International Conference on Mining Software Repositories (MSR) has hosted a mining challenge since 2006. With this challenge, we call upon everyone interested to apply their tools to a common dataset. The challenge is for researchers and practitioners to bravely use their mining tools and approaches on a dare.
Mon 29 JunDisplayed time zone: (UTC) Coordinated Universal Time change
11:00 - 12:00 | Build, CI, & DependenciesTechnical Papers / Registered Reports / Keynote / MSR Awards / FOSS Award / Education / Data Showcase / Mining Challenge / MSR Challenge Proposals / Ask Me Anything at MSR:Zoom Chair(s): Raula Gaikovina Kula NAIST Q/A & Discussion of Session Papers over Zoom (Joining info available on Slack) | ||
11:00 12mLive Q&A | A Tale of Docker Build Failures: A Preliminary StudyMSR - Technical Paper Technical Papers Yiwen Wu National University of Defense Technology, Yang Zhang National University of Defense Technology, China, Tao Wang National University of Defense Technology, Huaimin Wang Pre-print Media Attached | ||
11:12 12mLive Q&A | Using Others' Tests to Avoid Breaking UpdatesMSR - Technical Paper Technical Papers Suhaib Mujahid Concordia University, Rabe Abdalkareem Concordia University, Montreal, Canada, Emad Shihab Concordia University, Shane McIntosh McGill University Pre-print Media Attached | ||
11:24 12mLive Q&A | A Dataset of DockerfilesMSR - Data Showcase Data Showcase A: Jordan Henkel University of Wisconsin–Madison, A: Christian Bird Microsoft Research, A: Shuvendu K. Lahiri Microsoft Research, A: Thomas Reps University of Wisconsin-Madison, USA Media Attached | ||
11:36 12mLive Q&A | Empirical Study of Restarted and Flaky Builds on Travis CIMSR - Technical Paper Technical Papers Thomas Durieux KTH Royal Institute of Technology, Sweden, Claire Le Goues Carnegie Mellon University, Michael Hilton Carnegie Mellon University, USA, Rui Abreu Instituto Superior Técnico, U. Lisboa & INESC-ID DOI Pre-print Media Attached | ||
11:48 12mLive Q&A | LogChunks: A Data Set for Build Log AnalysisMSR - Data Showcase Data Showcase A: Carolin Brandt Delft University of Technology, A: Annibale Panichella Delft University of Technology, A: Andy Zaidman TU Delft, A: Moritz Beller Facebook, USA Pre-print Media Attached |
13:00 - 13:15 | "Opening" & AwardsMSR Plenary at MSR:Zoom Chair(s): Georgios Gousios Delft University of Technology, Sunghun Kim Hong Kong University of Science and Technology, Sarah Nadi University of Alberta Live on YouTube: https://www.youtube.com/watch?v=Qvf7mHa-YYs | ||
13:00 15mDay opening | MSR Opening & Awards MSR Plenary Sunghun Kim Hong Kong University of Science and Technology, Sarah Nadi University of Alberta, Georgios Gousios Delft University of Technology Media Attached |
14:30 - 15:00 | Tutorial 1: GDPR ConsiderationsEducation / Technical Papers at MSR:Zoom2 Chair(s): Abram Hindle University of Alberta, Alexander Serebrenik Eindhoven University of Technology Q/A for tutorial (Joining info available on Slack) | ||
14:30 30mTutorial | Mining Software Repositories While Respecting PrivacyMSR - Tutorial Education Pre-print Media Attached |
Tue 30 JunDisplayed time zone: (UTC) Coordinated Universal Time change
11:00 - 12:00 | SecurityData Showcase / Technical Papers at MSR:Zoom2 Chair(s): Dimitris Mitropoulos Athens University of Economics and Business Q/A & Discussion of Session Papers over Zoom (Joining info available on Slack) | ||
11:00 12mLive Q&A | Did You Remember To Test Your Tokens?MSR - Technical Paper Technical Papers Danielle Gonzalez Rochester Institute of Technology, USA, Michael Rath Technische Universität Ilmenau, Mehdi Mirakhorli Rochester Institute of Technology DOI Pre-print Media Attached | ||
11:12 12mLive Q&A | Automatically Granted Permissions in Android appsMSR - Technical Paper Technical Papers Paolo Calciati IMDEA Software Institute, Konstantin Kuznetsov Saarland University, CISPA, Alessandra Gorla IMDEA Software Institute, Andreas Zeller CISPA Helmholtz Center for Information Security Media Attached | ||
11:24 12mLive Q&A | PUMiner: Mining Security Posts from Developer Question and Answer Websites with PU LearningMSR - Technical Paper Technical Papers Triet Le The University of Adelaide, David Hin , Roland Croft , Muhammad Ali Babar The University of Adelaide DOI Pre-print Media Attached | ||
11:36 12mLive Q&A | A C/C++ Code Vulnerability Dataset with Code Changes and CVE SummariesMSR - Data Showcase Data Showcase A: Jiahao Fan New Jersey Institute of Technology, USA, A: Yi Li New Jersey Institute of Technology, USA, A: Shaohua Wang New Jersey Institute of Technology, USA, A: Tien N. Nguyen University of Texas at Dallas Media Attached | ||
11:48 12mLive Q&A | The Impact of a Major Security Event on an Open Source Project: The Case of OpenSSLMSR - Technical Paper Technical Papers James Walden Northern Kentucky University Pre-print Media Attached |
Accepted Papers
Call for Papers
The International Conference on Mining Software Repositories (MSR) has hosted a mining challenge since 2006. With this challenge, we call upon everyone interested to apply their tools to a common dataset. The challenge is for researchers and practitioners to bravely use their mining tools and approaches on a dare.
This year, the challenge is about mining the Software Heritage Graph Dataset, a very large dataset containing the development history of publicly available software, at the granularity used by state-of-the-art distributed version control systems. Included software artifacts were retrieved from major collaborative development platforms (e.g., GitHub, GitLab) and package repositories (e.g., PyPI, Debian, npm), and stored in a uniform representation: a fully-deduplicated Merkle DAG linking together source code files organized in directories, commits tracking evolution over time, up to full snapshots of version control systems (VCS) repositories as observed by the Software Heritage during periodic crawls.
Analyses can be based on the Software Heritage Graph Dataset alone or expanded to also include data from other resources such as GHTorrent, the Ultimate Debian Database, or any other dataset about software artifacts included in the dataset (e.g., previous studies about NPM, PyPI, etc). Note that the dataset does not contain the source code files themselves, but refers to them using persistent identifiers that can be used to cross-reference source code files referenced in previous studies/datasets or even retrieve source code of interest from Software Heritage.
The overall goal is to study public software development, expanding the scope of analysis of previous studies to a novel scale thanks to: (1) a good approximation of the entire corpus of publicly available software, (2) blending together related development histories in a single graph, and (3) abstracting over VCS and package differences, offering a canonical representation of source code artifacts.
Questions that are, to the best of our knowledge, not sufficiently answered and could be answered using this year dataset include:
- Scale: Can previous software mining results be reproduced when looking at all the projects of a given kind rather than the “most starred”? At what point is sampling sufficient?
- Cross-repository analysis: How can forking and duplication patterns inform us on software health and risks? How can community forks be distinguished from personal-use forks? What are good predictors of the success of a community fork?
- Cross-origin analysis: Is software evolution consistent across different version control systems? Are there VCS-specific development patterns? How does a migration from a VCS to another affect development patterns? Is there a relationship between development cycles and package manager releases?
- Graph structure: How tightly coupled are the different layers of the graph? What is the deduplication efficiency across different programming languages? When and where do source code files or directories tend to be reused? How is code shared between different forges?
These are just some of the questions that could be answered using the Software Heritage Graph Dataset. We encourage challenge participants to adapt the above research questions or formulate their own about any hidden knowledge that still defeats discovery in the treasure trove of our collective software commons!
How to Participate in the Challenge
First, familiarize yourself with the Software Heritage Graph Dataset:
- Read our MSR 2019 paper about the *Software Heritage Graph Dataset and the preprint of our mining challenge proposal, which contains example queries.
- Study the documentation of the dataset, which includes the most recent database layout, download information, as well as smaller dataset teasers that you can start with to whet your appetite.
- Join the public discussion mailing list to discuss with other challenge participants and chairs how to best exploit the dataset.
Then, use the dataset to answer your research questions, report your findings in a four-page data challenge paper (see information below) and submit your abstract and paper in time (see important dates below). If your paper is accepted, present your results at MSR 2020 in Seoul, South Korea!
Submission
A challenge paper should describe the results of your work by providing an introduction to the problem you address and why it is worth studying, the version of the dataset you used, the approach and tools you used, your results and their implications, and conclusions. Make sure your report highlights the contributions and the importance of your work. See also our open science policy regarding the publication of software and additional data you used for the challenge.
Challenge papers must not exceed 4 pages plus 1 additional page only with references and must conform to the MSR 2020 format and submission guidelines. Each submission will be reviewed by at least three members of the program committee. Submissions should follow the ACM Conference Proceedings Formatting Guidelines (https://www.acm.org/publications/proceedings-template). LaTeX users must use the provided acmart.cls
and ACM-Reference-Format.bst
without modification, enable the conference format in the preamble of the document (i.e., \documentclass[sigconf,review]{acmart}
), and use the ACM reference format for the bibliography (i.e., \bibliographystyle{ACM-Reference-Format}
). The review option adds line numbers, thereby allowing referees to refer to specific lines in their comments.
IMPORTANT: MSR 2020 follows the double-blind submission model. Submissions should not reveal the identity of the authors in any way. This means that authors should:
- leave out author names and affiliations from the body and metadata of the submitted pdf
- ensure that any citations to related work by themselves are written in the third person, for example “the prior work of XYZ [2]” as opposed to “our prior work [2]”
- not refer to their personal, lab or university website; similarly, care should be taken with personal accounts on GitHub, Google Drive, etc.
- not upload unblinded versions of their paper on archival websites during bidding/reviewing. However uploading unblinded versions prior to submission is allowed and sometimes unavoidable (e.g., thesis).
Authors having further questions on double blind reviewing are encouraged to contact the Mining Challenge Chairs via email.
Papers must be submitted electronically through EasyChair, should not have been published elsewhere, and should not be under review or submitted for review elsewhere for the duration of consideration. ACM plagiarism policy and procedures shall be followed for cases of double submission. The submission must also comply with the IEEE Policy on Authorship.
Upon notification of acceptance, all authors of accepted papers will receive further instructions for preparing their camera ready versions. At least one author of each accepted paper is expected to register and present the results at MSR 2020 in Seoul, South Korea. All accepted contributions will be published in the electronic conference proceedings.
The dataset as object of study for the challenge can be cited through reference [MSR20DC] below, while the Software Heritage dataset itself and its schema can be referenced via [MSR19SH], which also contains additional sample queries.
@inproceedings{MSR20DC,
title={The {Software Heritage Graph Dataset}: Large-scale Analysis of Public Software Development History},
publisher = {IEEE},
year = {2020},
author={Antoine Pietri and Diomidis Spinellis and Stefano Zacchiroli},
year={2020},
booktitle={MSR 2020: The 17th International Conference on Mining Software Repositories},
preprint={https://upsilon.cc/~zack/research/publications/msr-2020-challenge.pdf}
}
@inproceedings{MSR19SH,
author = {Antoine Pietri and Diomidis Spinellis and Stefano Zacchiroli},
title = {The Software Heritage Graph Dataset: Public software development under one roof},
publisher = {IEEE},
year = {2019},
doi = {10.1109/MSR.2019.00030},
pages = {138-142},
booktitle = {MSR 2019: The 16th International Conference on Mining Software Repositories},
preprint={https://upsilon.cc/~zack/research/publications/msr-2019-swh.pdf}
}
Important Dates
- Abstracts due: January 30, 2020 (AOE)
- Papers due: February 6, 2020 (AOE)
- Author notification: March 2, 2020 (AOE)
- Camera ready: March 16, 2020 (AOE)
Open Science Policy
Openness in science is key to fostering progress via transparency, reproducibility and replicability. Our steering principle is that all research output should be accessible to the public and that empirical studies should be reproducible. In particular, we actively support the adoption of open data and open source principles. To increase reproducibility and replicability, we encourage all contributing authors to disclose:
- the source code of the software they used to retrieve and analyze the data
- the (anonymized and curated) empirical data they retrieved in addition to the challenge dataset
- a document with instructions for other researchers describing how to reproduce or replicate the results
Already upon submission, authors can privately share their anonymized data and software on preservation archives such as Zenodo, Figshare (see instructions), and Software Heritage (see instructions). After acceptance, data and software should be made public and referenceable. We also encourage authors to self-archive pre- and postprints of their papers in open, preserved repositories such as arXiv.org.
Best Mining Challenge Paper Award
All submissions will undergo the same review process independent of whether or not they disclose their analysis code or data. However, only accepted papers for which code and data are available on preservation archives, as described in the open science policy above, will be considered for the best mining challenge paper award.
Best Student Presentation Award
Like in the previous years, there will be a public voting during the conference to select the best mining challenge presentation. This award often goes to authors of compelling work who present an engaging story to the audience. To increase student involvement, only students can compete for this award.
Organization
- Antoine Pietri, Inria, France
- Diomidis Spinellis, Athens University of Economics and Business, Greece
- Stefano Zacchiroli, University Paris Diderot and Inria, France
Resources for Participants
- Dataset documentation
- Papers:
- Public discussion mailing list, among challenge participants and with the chairs
- Software Heritage public IRC channel for development discussions: #swh-devel @ irc.freenode.net