A Study on the Accuracy of OCR Engines for Source Code Transcription from Programming ScreencastsMSR - Technical Paper
Programming screencasts can be a rich source of documentation for developers, as they often contain step-by-step explanations of source code, programming concepts, errors, etc. Despite the availability of millions of such videos, the rich information available in programming screencasts, and especially the source code being displayed on screen is not easy to find, search, or reuse by programmers. Recent research has identified this challenge and has proposed solutions that identify and extract source code from video tutorials in order to make it readily available to developers or to other processing tools. A crucial component in this line of work is the choice of Optical Character Recognition (OCR) engine used to identify and transcribe the source code shown on screen. Previous work has simply chosen one OCR engine or another, without consideration for the accuracy of these engines on source code recognition. In this paper we aim to address this oversight and present an empirical study on the accuracy of six OCR engines on the extraction of source code from programming screencasts and code images. Our results show that the transcription accuracy varies greatly from one OCR engine to another and that the most widely chosen OCR engine in previous studies is by far not the best choice. We also show how other factors, such as font type and size can impact the results. We conclude by offering guidelines for programming screencast creators on which fonts to use to enable a better OCR recognition of their source code, as well as advice on OCR choice for researchers aiming to analyze source code in screencasts.