A Study on the Accuracy of OCR Engines for Source Code Transcription from Programming Screencasts (MSR 2020 - Technical Papers)

Who

Abdulkarim Malkadi, Mohammad Alahmadi, Sonia Haiduc

Track

MSR 2020 Technical Papers

Time Zone

The program is currently displayed in (UTC) Coordinated Universal Time.

Use conference time zone: (UTC) Coordinated Universal TimeSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 30 Jun 2020 14:24 - 14:36 at MSR:Zoom - ML4SE Chair(s): Kevin Moran

Abstract

Programming screencasts can be a rich source of documentation for developers, as they often contain step-by-step explanations of source code, programming concepts, errors, etc. Despite the availability of millions of such videos, the rich information available in programming screencasts, and especially the source code being displayed on screen is not easy to find, search, or reuse by programmers. Recent research has identified this challenge and has proposed solutions that identify and extract source code from video tutorials in order to make it readily available to developers or to other processing tools. A crucial component in this line of work is the choice of Optical Character Recognition (OCR) engine used to identify and transcribe the source code shown on screen. Previous work has simply chosen one OCR engine or another, without consideration for the accuracy of these engines on source code recognition. In this paper we aim to address this oversight and present an empirical study on the accuracy of six OCR engines on the extraction of source code from programming screencasts and code images. Our results show that the transcription accuracy varies greatly from one OCR engine to another and that the most widely chosen OCR engine in previous studies is by far not the best choice. We also show how other factors, such as font type and size can impact the results. We conclude by offering guidelines for programming screencast creators on which fonts to use to enable a better OCR recognition of their source code, as well as advice on OCR choice for researchers aiming to analyze source code in screencasts.

Link to Preprint

https://www.researchgate.net/publication/342436852_A_Study_on_the_Accuracy_of_OCR_Engines_for_Source_Code_Transcription_from_Programming_Screencasts

Abdulkarim Malkadi

Florida State University, USA - Jazan University, KSA

Mohammad Alahmadi

Florida State University

Sonia Haiduc

Florida State University

United States

Media