A Machine Learning Approach for Vulnerability Curation (MSR 2020 - Technical Papers)

Who

Chen Yang, Andrew Santosa, Ang Ming Yi, Abhishek Sharma , Asankhaya Sharma, David Lo

Track

MSR 2020 Technical Papers

Time Zone

The program is currently displayed in (UTC) Coordinated Universal Time.

Use conference time zone: (UTC) Coordinated Universal TimeSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 30 Jun 2020 14:00 - 14:12 at MSR:Zoom - ML4SE Chair(s): Kevin Moran

Abstract

Central to software composition analysis is a database of vulnerabilities of open-source libraries. Security researchers curate this database from various data sources, including bug tracking systems, commits, and mailing lists. In this article, we report the design and implementation of a machine learning system to help the curation by automatically predicting the vulnerability-relatedness of each data item. It supports a complete pipeline from data collection, model training and prediction, to the validation of new models before deployment. It is executed iteratively to generate better models as new input data become available. It is enhanced by self-training to significantly and automatically increase the size of the training dataset, opportunistically maximizing the improvement in the models’ quality at each iteration. We devised new “deployment stability” metric to evaluate the quality of the new models before deployment into production. We experimentally evaluate the improvement in the performance of the models in one iteration, with 27.59% maximum PR AUC improvements. Ours is the first of such study across a variety of data sources. We discover that the addition of the features of the corresponding commits to the features of issues/pull requests improve the precision for the recall values that matter. We demonstrate the effectiveness of self-training alone, with 10.50% PR AUC improvement, and we discover that there is no uniform ordering of word2vec parameters sensitivity across data sources. We show how the deployment stability metric helped to discover an error.

Link to Preprint

http://asankhaya.github.io/pdf/A-Machine-Learning-Approach-for-Vulnerability-Curation.pdf

Chen Yang

Veracode, Inc.

Andrew Santosa

Veracode, Inc.

Ang Ming Yi

Abhishek Sharma

Singapore Management University, Singapore

Asankhaya Sharma

Veracode, Inc.

Singapore

David Lo

Singapore Management University

Singapore

A Machine Learning Approach for Vulnerability Curation

bilibili link

Time Zone

The program is currently displayed in (UTC) Coordinated Universal Time.

Use conference time zone: (UTC) Coordinated Universal TimeSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 30 Jun
Displayed time zone: (UTC) Coordinated Universal Time change

14:00 - 15:00	ML4SETechnical Papers / Registered Reports / Keynote / MSR Awards / FOSS Award / Education / Data Showcase / Mining Challenge / MSR Challenge Proposals / Ask Me Anything at MSR:Zoom Chair(s): Kevin Moran William & Mary/George Mason University Q/A & Discussion of Session Papers over Zoom (Joining info available on Slack)

14:00 12m Live Q&A		A Machine Learning Approach for Vulnerability CurationACM SIGSOFT Distinguished Paper AwardMSR - Technical Paper Technical Papers Chen Yang Veracode, Inc., Andrew Santosa Veracode, Inc., Ang Ming Yi , Abhishek Sharma Singapore Management University, Singapore, Asankhaya Sharma Veracode, Inc., David Lo Singapore Management University Pre-print Media Attached
14:12 12m Live Q&A		Embedding Java Classes with code2vec: Improvements from Variable ObfuscationMSR - Technical Paper Technical Papers Rhys Compton University of Waikato, Eibe Frank Department of Computer Science, University of Waikato, Panos Patros , Abigail Koay University of Waikato DOI Pre-print Media Attached
14:24 12m Live Q&A		A Study on the Accuracy of OCR Engines for Source Code Transcription from Programming ScreencastsMSR - Technical Paper Technical Papers Abdulkarim Malkadi Florida State University, USA - Jazan University, KSA, Mohammad Alahmadi Florida State University, Sonia Haiduc Florida State University Pre-print Media Attached
14:36 12m Live Q&A		What is the Vocabulary of Flaky Tests?MSR - Technical Paper Technical Papers Gustavo Pinto UFPA, Breno Miranda Federal University of Pernambuco, Supun Dissanayake The University of Adelaide, Marcelo d'Amorim Federal University of Pernambuco, Christoph Treude The University of Adelaide, Antonia Bertolino CNR-ISTI Pre-print Media Attached
14:48 12m Live Q&A		Improved Automatic Summarization of Subroutines via Attention to File ContextMSR - Technical Paper Technical Papers Sakib Haque University of Notre Dame, Alexander LeClair University Of Notre Dame, Lingfei Wu IBM Research, Collin McMillan University of Notre Dame Pre-print Media Attached