Polyglot and Distributed Software Repository Mining with CROSSFLOW
MSR - Technical Paper
Repository mining of large-scale software systems often requires substantial storage and computational resources, commonly involving a large number of calls made to rate-limited APIs, such as those exposed by GitHub and StackOverflow. This creates an increasing need for repository mining programs to be executed in a distributed manner, such that remote collaborators can contribute computational and storage resources, as well as API quotas (without the need for sharing API access tokens or credentials). In this paper we present CROSSFLOW, a novel framework for building polyglot distributed repository mining programs. We demonstrate how CROSSFLOW offers delegation of mining jobs to remote workers and can cache their results, how such workers are able to implement advanced behaviors like load balancing and rejecting jobs they either cannot perform or would execute sub-optimally, and how workers of the same analysis program can be written in different programing languages like Java and Python, executing only relevant parts of the program described in that language.