MSR 2020
Mon 29 - Tue 30 June 2020
co-located with ICSE 2020
Tue 30 Jun 2020 14:24 - 14:36 at MSR:Zoom - ML4SE Chair(s): Kevin Moran

Programming screencasts can be a rich source of documentation for developers, as they often contain step-by-step explanations of source code, programming concepts, errors, etc. Despite the availability of millions of such videos, the rich information available in programming screencasts, and especially the source code being displayed on screen is not easy to find, search, or reuse by programmers. Recent research has identified this challenge and has proposed solutions that identify and extract source code from video tutorials in order to make it readily available to developers or to other processing tools. A crucial component in this line of work is the choice of Optical Character Recognition (OCR) engine used to identify and transcribe the source code shown on screen. Previous work has simply chosen one OCR engine or another, without consideration for the accuracy of these engines on source code recognition. In this paper we aim to address this oversight and present an empirical study on the accuracy of six OCR engines on the extraction of source code from programming screencasts and code images. Our results show that the transcription accuracy varies greatly from one OCR engine to another and that the most widely chosen OCR engine in previous studies is by far not the best choice. We also show how other factors, such as font type and size can impact the results. We conclude by offering guidelines for programming screencast creators on which fonts to use to enable a better OCR recognition of their source code, as well as advice on OCR choice for researchers aiming to analyze source code in screencasts.

Tue 30 Jun
Times are displayed in time zone: (UTC) Coordinated Universal Time change

msr-2020-papers
14:00 - 15:00: Technical Papers - ML4SE at MSR:Zoom
Chair(s): Kevin MoranGeorge Mason University

Q/A & Discussion of Session Papers over Zoom (Joining info available on Slack)

msr-2020-papers14:00 - 14:12
Live Q&A
Chen YangVeracode, Inc., Andrew SantosaVeracode, Inc., Ang Ming Yi, Abhishek Sharma Singapore Management University, Singapore, Asankhaya SharmaVeracode, Inc., David LoSingapore Management University
Pre-print Media Attached
msr-2020-papers14:12 - 14:24
Live Q&A
Rhys ComptonUniversity of Waikato, Eibe FrankDepartment of Computer Science, University of Waikato, Panos Patros, Abigail KoayUniversity of Waikato
DOI Pre-print Media Attached
msr-2020-papers14:24 - 14:36
Live Q&A
Abdulkarim KhormiFlorida State University, USA - Jazan University, KSA, Mohammad AlahmadiFlorida State University, Sonia HaiducFlorida State University
Pre-print Media Attached
msr-2020-papers14:36 - 14:48
Live Q&A
Gustavo PintoUFPA, Breno MirandaFederal University of Pernambuco, Supun DissanayakeThe University of Adelaide, Marcelo d'AmorimFederal University of Pernambuco, Christoph TreudeThe University of Adelaide, Antonia BertolinoCNR-ISTI
Pre-print Media Attached
msr-2020-papers14:48 - 15:00
Live Q&A
Sakib HaqueUniversity of Notre Dame, Alexander LeClairUniversity Of Notre Dame, Lingfei WuIBM Research, Collin McMillanUniversity of Notre Dame
Pre-print Media Attached