MSR 2020
Mon 29 - Tue 30 June 2020
co-located with ICSE 2020
Tue 30 Jun 2020 14:24 - 14:36 at MSR:Zoom - ML4SE Chair(s): Kevin Moran

Programming screencasts can be a rich source of documentation for developers, as they often contain step-by-step explanations of source code, programming concepts, errors, etc. Despite the availability of millions of such videos, the rich information available in programming screencasts, and especially the source code being displayed on screen is not easy to find, search, or reuse by programmers. Recent research has identified this challenge and has proposed solutions that identify and extract source code from video tutorials in order to make it readily available to developers or to other processing tools. A crucial component in this line of work is the choice of Optical Character Recognition (OCR) engine used to identify and transcribe the source code shown on screen. Previous work has simply chosen one OCR engine or another, without consideration for the accuracy of these engines on source code recognition. In this paper we aim to address this oversight and present an empirical study on the accuracy of six OCR engines on the extraction of source code from programming screencasts and code images. Our results show that the transcription accuracy varies greatly from one OCR engine to another and that the most widely chosen OCR engine in previous studies is by far not the best choice. We also show how other factors, such as font type and size can impact the results. We conclude by offering guidelines for programming screencast creators on which fonts to use to enable a better OCR recognition of their source code, as well as advice on OCR choice for researchers aiming to analyze source code in screencasts.

Tue 30 Jun

Displayed time zone: (UTC) Coordinated Universal Time change

14:00 - 15:00
ML4SETechnical Papers / Registered Reports / Keynote / MSR Awards / FOSS Award / Education / Data Showcase / Mining Challenge / MSR Challenge Proposals / Ask Me Anything at MSR:Zoom
Chair(s): Kevin Moran William & Mary/George Mason University

Q/A & Discussion of Session Papers over Zoom (Joining info available on Slack)

14:00
12m
Live Q&A
A Machine Learning Approach for Vulnerability CurationACM SIGSOFT Distinguished Paper AwardMSR - Technical Paper
Technical Papers
Chen Yang Veracode, Inc., Andrew Santosa Veracode, Inc., Ang Ming Yi , Abhishek Sharma Singapore Management University, Singapore, Asankhaya Sharma Veracode, Inc., David Lo Singapore Management University
Pre-print Media Attached
14:12
12m
Live Q&A
Embedding Java Classes with code2vec: Improvements from Variable ObfuscationMSR - Technical Paper
Technical Papers
Rhys Compton University of Waikato, Eibe Frank Department of Computer Science, University of Waikato, Panos Patros , Abigail Koay University of Waikato
DOI Pre-print Media Attached
14:24
12m
Live Q&A
A Study on the Accuracy of OCR Engines for Source Code Transcription from Programming ScreencastsMSR - Technical Paper
Technical Papers
Abdulkarim Malkadi Florida State University, USA - Jazan University, KSA, Mohammad Alahmadi Florida State University, Sonia Haiduc Florida State University
Pre-print Media Attached
14:36
12m
Live Q&A
What is the Vocabulary of Flaky Tests?MSR - Technical Paper
Technical Papers
Gustavo Pinto UFPA, Breno Miranda Federal University of Pernambuco, Supun Dissanayake The University of Adelaide, Marcelo d'Amorim Federal University of Pernambuco, Christoph Treude The University of Adelaide, Antonia Bertolino CNR-ISTI
Pre-print Media Attached
14:48
12m
Live Q&A
Improved Automatic Summarization of Subroutines via Attention to File ContextMSR - Technical Paper
Technical Papers
Sakib Haque University of Notre Dame, Alexander LeClair University Of Notre Dame, Lingfei Wu IBM Research, Collin McMillan University of Notre Dame
Pre-print Media Attached