Su, Zhong
D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis
Zheng, Yunhui, Pujar, Saurabh, Lewis, Burn, Buratti, Luca, Epstein, Edward, Yang, Bo, Laredo, Jim, Morari, Alessandro, Su, Zhong
Static analysis tools are widely used for vulnerability detection as they understand programs with complex behavior and millions of lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to understand programming languages opens new possibilities when applied to static analysis. However, existing datasets to train models for vulnerability identification suffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose D2A, a differential analysis based approach to label issues reported by static analysis tools. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select bug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset to train models for vulnerability identification. We show that the dataset can be used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first.
Analyzing and Predicting Not-Answered Questions in Community-based Question Answering Services
Yang, Lichun (Shanghai Jiao Tong University) | Bao, Shenghua (IBM Research China) | Lin, Qingliang (Shanghai Jiao Tong University) | Wu, Xian (IBM Research China) | Han, Dingyi (Shanghai Jiao Tong University) | Su, Zhong (IBM Research China) | Yu, Yong (Shanghai Jiao Tong University)
This paper focuses on analyzing and predicting not-answered questions in Community based Question Answering (CQA) services, such as Yahoo! Answers. In CQA services, users express their information needs by submitting natural language questions and await answers from other human users. Comparing to receiving results from web search engines using keyword queries, CQA users are likely to get more specific answers, because human answerers may catch the main point of the question. However, one of the key problems of this pattern is that sometimes no one helps to give answers, while web search engines hardly fail to response. In this paper, we analyze the not-answered questions and give a first try of predicting whether questions will receive answers. More specifically, we first analyze the questions of Yahoo Answers based on the features selected from different perspectives. Then, we formalize the prediction problem as supervised learning – binary classification problem and leverage the proposed features to make predictions. Extensive experiments are made on 76,251 questions collected from Yahoo! Answers. We analyze the specific characteristics of not-answered questions and try to suggest possible reasons for why a question is not likely to be answered. As for prediction, the experimental results show that classification based on the proposed features outperforms the simple word-based approach significantly.