The problem with searching for code is that the query, e.g. "read an object from xml," doesn't look very much like the source code snippets that are the intended results, e.g.: That's why we have Stack Overflow! Stack Overflow can help with'how to' style queries, but it can't help with searches inside codebases you care about. For example, "where in this codebase are events queued on a thread?" DeepCS is just such a search engine for code, based on the CODEnn (Code-Description Embedding Neural Network) network model.
A new benchmark to evaluate code search techniques. The benchmark includes the largest evaluation data set currently available for Java, consisting of a natural language query and code snippet pairs. This data set comprises 287 Stack Overflow question-and-answer pairs from the Stack Exchange Data Dump. Also included is a search corpus that contains more than 24,000 of the most popular Android repositories on GitHub (ranked by the number of stars) and is indexed using the more than 4.7 million method bodies parsed from these repositories. A score sheet on the evaluation data set, using two models from our recent work, is also included.
Code search and comprehension have become more difficult in recent years due to the rapid expansion of available source code. Current tools lack a way to label arbitrary code at scale while maintaining up-to-date representations of new programming languages, libraries, and functionalities. Comprehensive labeling of source code enables users to search for documents of interest and obtain a high-level understanding of their contents. We use Stack Overflow code snippets and their tags to train a language-agnostic, deep convolutional neural network to automatically predict semantic labels for source code documents. On Stack Overflow code snippets, we demonstrate a mean area under ROC of 0.957 over a long-tailed list of 4,508 tags. We also manually validate the model outputs on a diverse set of unlabeled source code documents retrieved from Github, and we obtain a top-1 accuracy of 86.6%. This strongly indicates that the model successfully transfers its knowledge from Stack Overflow snippets to arbitrary source code documents.
Stack Overflow is the most popular Q&A website among software developers. As a platform for knowledge sharing and acquisition, the questions posted in Stack Overflow usually contain a code snippet. Stack Overflow relies on users to properly tag the programming language of a question and it simply assumes that the programming language of the snippets inside a question is the same as the tag of the question itself. In this paper, we propose a classifier to predict the programming language of questions posted in Stack Overflow using Natural Language Processing (NLP) and Machine Learning (ML). The classifier achieves an accuracy of 91.1% in predicting the 24 most popular programming languages by combining features from the title, body and the code snippets of the question. We also propose a classifier that only uses the title and body of the question and has an accuracy of 81.1%. Finally, we propose a classifier of code snippets only that achieves an accuracy of 77.7%. These results show that deploying Machine Learning techniques on the combination of text and the code snippets of a question provides the best performance. These results demonstrate also that it is possible to identify the programming language of a snippet of few lines of source code. We visualize the feature space of two programming languages Java and SQL in order to identify some special properties of information inside the questions in Stack Overflow corresponding to these languages.