Plotting

 Devanbu, Prem


Quality and Trust in LLM-generated Code

arXiv.org Artificial Intelligence

Machine learning models are widely used but can also often be wrong. Users would benefit from a reliable indication of whether a given output from a given model should be trusted, so a rational decision can be made whether to use the output or not. For example, outputs can be associated with a confidence measure; if this confidence measure is strongly associated with likelihood of correctness, then the model is said to be well-calibrated. In this case, for example, high-confidence outputs could be safely accepted, and low-confidence outputs rejected. Calibration has so far been studied in non-generative (e.g., classification) settings, especially in Software Engineering. However, generated code can quite often be wrong: Developers need to know when they should e.g., directly use, use after careful review, or discard model-generated code; thus Calibration is vital in generative settings. However, the notion of correctness of generated code is non-trivial, and thus so is Calibration. In this paper we make several contributions. We develop a framework for evaluating the Calibration of code-generating models. We consider several tasks, correctness criteria, datasets, and approaches, and find that by and large generative code models are not well-calibrated out of the box. We then show how Calibration can be improved, using standard methods such as Platt scaling. Our contributions will lead to better-calibrated decision-making in the current use of code generated by language models, and offers a framework for future research to further improve calibration methods for generative models in Software Engineering.


Towards Understanding What Code Language Models Learned

arXiv.org Artificial Intelligence

Pre-trained language models are effective in a variety of natural language tasks, but it has been argued their capabilities fall short of fully learning meaning or understanding language. To understand the extent to which language models can learn some form of meaning, we investigate their ability to capture semantics of code beyond superficial frequency and co-occurrence. In contrast to previous research on probing models for linguistic features, we study pre-trained models in a setting that allows for objective and straightforward evaluation of a model's ability to learn semantics. In this paper, we examine whether such models capture the semantics of code, which is precisely and formally defined. Through experiments involving the manipulation of code fragments, we show that code pre-trained models of code learn a robust representation of the computational semantics of code that goes beyond superficial features of form alone


Deep Learning & Software Engineering: State of Research and Future Directions

arXiv.org Artificial Intelligence

The advent of deep learning (DL) has fundamentally changed the landscape of modern software. Generally, a DL system is comprised of several interconnected computational units that form "layers" which perform mathematical transformations, according to sets of learnable parameters, on data passing through them. These architectures can be "trained" for specific tasks by updating the parameters according to a model's performance on a labeled set of training data. DL represents a fundamental shift in the manner by which machines learn patterns from data by automatically extracting salient features for a given computational task, as opposed to relying upon human intuition. These DL systems can be viewed as an inflection point for software development, as they enable new capabilities that cannot be realized cost-effectively through "traditional" software wherein the behavior of a program must be specified analytically.