PUMiner: Mining Security Posts from Developer Question and Answer Websites with PU Learning (MSR 2020 - Technical Papers)

Who

Triet Le, David Hin, Roland Croft, Muhammad Ali Babar

Track

MSR 2020 Technical Papers

Time Zone

The program is currently displayed in (UTC) Coordinated Universal Time.

Use conference time zone: (UTC) Coordinated Universal TimeSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 30 Jun 2020 11:24 - 11:36 at MSR:Zoom2 - Security Chair(s): Dimitris Mitropoulos

Abstract

Security is an increasing concern in software development. Developer Question and Answer (Q&A) websites provide a large amount of security discussion. Existing studies have used human-defined rules to mine security discussions, but these works still miss many posts, which may lead to an incomplete analysis of the security practices reported on Q&A websites. Traditional supervised Machine Learning methods can automate the mining process; however, the required negative (non-security) class is too expensive to obtain. We propose a novel learning framework, PUMiner, to automatically mine security posts from Q&A websites. PUMiner builds a context-aware embedding model to extract features of the posts, and then develops a two-stage PU model to identify security content using the labelled Positive and Unlabelled posts. We evaluate PUMiner on more than 17.2 million posts on Stack Overflow and 52,611 posts on Security StackExchange. We show that PUMiner is effective with the validation performance of at least 85% across all model configurations. Moreover, Matthews Correlation Coefficient (MCC) of PUMiner is 0.906 points, 148% and 10.4% higher than one-class SVM, positive-similarity filtering, and one-stage PU models on unseen testing posts, respectively. PUMiner also performs well with an MCC of 0.745 for scenarios where string matching totally fails. Even when the ratio of the labelled positive posts to the unlabelled ones is only 1:100, PUMiner still achieves a strong MCC of 0.65, which is 160% better than fully-supervised learning. Using PUMiner, we provide the largest and up-to-date security content on Q&A websites for practitioners and researchers.

Link to Preprint

https://arxiv.org/abs/2003.03741

DOI

https://doi.org/10.1145/3379597.3387443

Triet Le

The University of Adelaide

Australia

David Hin

Roland Croft

Muhammad Ali Babar

The University of Adelaide

PUMiner: Mining Security Posts from Developer Question and Answer Websites with PU Learning

bilibili link