Visions & Reflections

In the last few years, artificial intelligence (AI) and machine learning (ML) have become ubiquitous terms. These powerful techniques have escaped obscurity in academic communities with the recent onslaught of AI & ML tools, frameworks, and libraries that make these techniques accessible to a wider audience of developers. As a result, applying AI & ML to solve existing and emergent problems is an increasingly popular practice. However, little is known about this domain from the software engineering perspective. Many AI & ML tools and applications are open source, and hosted on platforms such as GitHub that provide rich tools for large-scale distributed software development. Despite widespread use and popularity, these repositories have never been examined as a community to identify unique properties, development patterns, and trends.

In this paper, we conducted a large-scale empirical study of AI & ML Tool (700) and Application (4,524) repositories hosted on GitHub to develop such a characterization. To compare this community to the wider population of repositories, we compare our analyses to 4,101 unrelated repositories. We enhance this characterization with an elaborate study of developer workflow that measures collaboration and autonomy within a repository. We’ve captured key insights of this community’s 10 year history such as it’s primary language (Python) and most popular repositories (Tensorflow, Tesseract). Our findings show the AI & ML community has unique characteristics that should be accounted for in future research.

Danielle GonzalezRochester Institute of Technology, USA, Thomas ZimmermannMicrosoft Research, Nachiappan NagappanMicrosoft Research
Nicolas GoldUniversity College London, Jens KrinkeUniversity College London
Ang JiaXi'an Jiaotong University, Ming FanXi'an Jiaotong University, Xi Xu, Di CuiXi'an Jiaotong University, Wenying Wei, Zijiang YangWestern Michigan University, Kai Ye, Ting LiuXi'an Jiaotong University
