Maniatis, Petros
AI-Assisted Assessment of Coding Practices in Modern Code Review
Vijayvergiya, Manushree, Salawa, Małgorzata, Budiselić, Ivan, Zheng, Dan, Lamblin, Pascal, Ivanković, Marko, Carin, Juanjo, Lewko, Mateusz, Andonov, Jovan, Petrović, Goran, Tarlow, Daniel, Maniatis, Petros, Just, René
Modern code review is a process in which an incremental code contribution made by a code author is reviewed by one or more peers before it is committed to the version control system. An important element of modern code review is verifying that code contributions adhere to best practices. While some of these best practices can be automatically verified, verifying others is commonly left to human reviewers. This paper reports on the development, deployment, and evaluation of AutoCommenter, a system backed by a large language model that automatically learns and enforces coding best practices. We implemented AutoCommenter for four programming languages (C++, Java, Python, and Go) and evaluated its performance and adoption in a large industrial setting. Our evaluation shows that an end-to-end system for learning and enforcing coding best practices is feasible and has a positive impact on the developer workflow. Additionally, this paper reports on the challenges associated with deploying such a system to tens of thousands of developers and the corresponding lessons learned.
CodeQueries: A Dataset of Semantic Queries over Code
Sahu, Surya Prakash, Mandal, Madhurima, Bharadwaj, Shikhar, Kanade, Aditya, Maniatis, Petros, Shevade, Shirish
Developers often have questions about semantic aspects of code they are working on, e.g., "Is there a class whose parent classes declare a conflicting attribute?". Answering them requires understanding code semantics such as attributes and inheritance relation of classes. An answer to such a question should identify code spans constituting the answer (e.g., the declaration of the subclass) as well as supporting facts (e.g., the definitions of the conflicting attributes). The existing work on question-answering over code has considered yes/no questions or method-level context. We contribute a labeled dataset, called CodeQueries, of semantic queries over Python code. Compared to the existing datasets, in CodeQueries, the queries are about code semantics, the context is file level and the answers are code spans. We curate the dataset based on queries supported by a widely-used static analysis tool, CodeQL, and include both positive and negative examples, and queries requiring single-hop and multi-hop reasoning. To assess the value of our dataset, we evaluate baseline neural approaches. We study a large language model (GPT3.5-Turbo) in zero-shot and few-shot settings on a subset of CodeQueries. We also evaluate a BERT style model (CuBERT) with fine-tuning. We find that these models achieve limited success on CodeQueries. CodeQueries is thus a challenging dataset to test the ability of neural models, to understand code semantics, in the extractive question-answering setting.
Neural Program Synthesis with a Differentiable Fixer
Balog, Matej, Singh, Rishabh, Maniatis, Petros, Sutton, Charles
We present a new program synthesis approach that combines an encoder-decoder based synthesis architecture with a differentiable program fixer. Our approach is inspired from the fact that human developers seldom get their program correct on the first attempt, and perform iterative testing-based program fixing to get to the desired program functionality. Similarly, our approach first learns a distribution over programs conditioned on an encoding of a set of input-output examples, and then iteratively performs fix operations using the differentiable fixer. The fixer takes as input the original examples and the current program's outputs on example inputs, and generates a new distribution over the programs with the goal of reducing the discrepancies between the current program outputs and the desired example outputs. We train our architecture end-to-end on the RobustFill domain, and show that the addition of the fixer module leads to a significant improvement on synthesis accuracy compared to using beam search.
Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression
Huang, Ling, Jia, Jinzhu, Yu, Bin, Chun, Byung-gon, Maniatis, Petros, Naik, Mayur
Predicting the execution time of computer programs is an important but challenging problem in the community of computer systems. Existing methods require experts to perform detailed analysis of program code in order to construct predictors or select important features. We recently developed a new system to automatically extract a large number of features from program execution on sample inputs, on which prediction models can be constructed without expert knowledge. In this paper we study the construction of predictive models for this problem. We propose the SPORE (Sparse POlynomial REgression) methodology to build accurate prediction models of program performance using feature data collected from program execution on sample inputs.
Neural Program Repair by Jointly Learning to Localize and Repair
Vasic, Marko, Kanade, Aditya, Maniatis, Petros, Bieber, David, Singh, Rishabh
Due to its potential to improve programmer productivity and software quality, automated program repair has been an active topic of research. Newer techniques harness neural networks to learn directly from examples of buggy programs and their fixes. In this work, we consider a recently identified class of bugs called variable-misuse bugs. We show that it is beneficial to train a model that jointly and directly localizes and repairs variable-misuse bugs. The experimental results show that the joint model significantly outperforms an enumerative solution that uses a pointer based model for repair alone. Advances in machine learning and the availability of large corpora of source code have led to growing interest in the development of neural representations of programs for performing program analyses. In recent work, Allamanis et al. (2018) proposed the problem of variable misuse (VARMISUSE): given a program, find program locations where variables are used, and predict the correct variables that should be in those locations. A VARMISUSEbug exists when the correct variable differs from the current one at a location. Allamanis et al. (2018) show that variable misuses occur in practice, e.g., when a programmer copies some code into a new context, but forgets to rename a variable from the older context, or when two variable names within the same scope are easily confused.
Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression
Huang, Ling, Jia, Jinzhu, Yu, Bin, Chun, Byung-gon, Maniatis, Petros, Naik, Mayur
Predicting the execution time of computer programs is an important but challenging problem in the community of computer systems. Existing methods require experts to perform detailed analysis of program code in order to construct predictors or select important features. We recently developed a new system to automatically extract a large number of features from program execution on sample inputs, on which prediction models can be constructed without expert knowledge. In this paper we study the construction of predictive models for this problem. We propose the SPORE (Sparse POlynomial REgression) methodology to build accurate prediction models of program performance using feature data collected from program execution on sample inputs. Our two SPORE algorithms are able to build relationships between responses (e.g., the execution time of a computer program) and features, and select a few from hundreds of the retrieved features to construct an explicitly sparse and non-linear model to predict the response variable. The compact and explicitly polynomial form of the estimated model could reveal important insights into the computer program (e.g., features and their non-linear combinations that dominate the execution time), enabling a better understanding of the program’s behavior. Our evaluation on three widely used computer programs shows that SPORE methods can give accurate prediction with relative error less than 7% by using a moderate number of training data samples. In addition, we compare SPORE algorithms to state-of-the-art sparse regression algorithms, and show that SPORE methods, motivated by real applications, outperform the other methods in terms of both interpretability and prediction accuracy.
Mantis: Predicting System Performance through Program Analysis and Modeling
Chun, Byung-Gon, Huang, Ling, Lee, Sangmin, Maniatis, Petros, Naik, Mayur
We present Mantis, a new framework that automatically predicts program performance with high accuracy. Mantis integrates techniques from programming language and machine learning for performance modeling, and is a radical departure from traditional approaches. Mantis extracts program features, which are information about program execution runs, through program instrumentation. It uses machine learning techniques to select features relevant to performance and creates prediction models as a function of the selected features. Through program analysis, it then generates compact code slices that compute these feature values for prediction. Our evaluation shows that Mantis can achieve more than 93% accuracy with less than 10% training data set, which is a significant improvement over models that are oblivious to program features. The system generates code slices that are cheap to compute feature values.