Goto

Collaborating Authors

 code synthesis


Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Neural Information Processing Systems

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis. However, these test-cases can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus - a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code.


An AST-guided LLM Approach for SVRF Code Synthesis

Abdelmalak, Abanoub E., Elsayed, Mohamed A., Abercrombie, David, Torunoglu, Ilhami

arXiv.org Artificial Intelligence

Standard Verification Rule Format (SVRF) is essential for semiconductor applications like Design Rule Check (DRC), Layout Versus Schematic (LVS), and Optical Proximity Correction (OPC) and it faces challenges as advancing nodes create complex design rules that renders traditional SVRF development ineffective and highlight an expertise gap. This paper introduces a novel methodology integrating Abstract Syntax Tree (AST) embedding and Retrieval-Augmented Generation (RAG) for enhanced SVRF code synthesis, ensuring semantic accuracy and error minimization through structural validation with domain-specific insights for precise code generation. We evaluate different T5-based models and propose an innovative SVRF-specific scoring framework that complements standard metrics like BLEU and ROUGE-L. In our approach, AST provides rigorous structural validation, while RAG infuses relevant domain knowledge, effectively enhancing the code generation workflow. Testing on a comprehensive benchmark of 740 DRC rule implementations, our methodology demonstrates up to a 40\% improvement in code generation accuracy compared to basic text-based fine-tuning process. This fusion of industry expertise with advanced coding strategies not only optimizes SVRF development under limited dataset constraints but also creates a more intuitive and efficient coding environment. Consequently, users can rapidly iterate through design cycles, reduce manual error correction, and significantly improve overall productivity.


Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Neural Information Processing Systems

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis. However, these test-cases can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus – a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code.


Training Language Models on Synthetic Edit Sequences Improves Code Synthesis

Piterbarg, Ulyana, Pinto, Lerrel, Fergus, Rob

arXiv.org Artificial Intelligence

Software engineers mainly write code by editing existing programs. In contrast, large language models (LLMs) autoregressively synthesize programs in a single pass. One explanation for this is the scarcity of open-sourced edit data. While high-quality instruction data for code synthesis is already scarce, high-quality edit data is even scarcer. To fill this gap, we develop a synthetic data generation algorithm called LintSeq. This algorithm refactors existing code into a sequence of code edits by using a linter to procedurally sample across the error-free insertions that can be used to sequentially write programs. It outputs edit sequences as text strings consisting of consecutive program diffs. To test LintSeq, we use it to refactor a dataset of instruction + program pairs into instruction + program-diff-sequence tuples. Then, we instruction finetune a series of smaller LLMs ranging from 2.6B to 14B parameters on both the re-factored and original versions of this dataset, comparing zero-shot performance on code synthesis benchmarks. We show that during repeated sampling, edit sequence finetuned models produce more diverse programs than baselines. This results in better inference-time scaling for benchmark coverage as a function of samples, i.e. the fraction of problems "pass@k" solved by any attempt given "k" tries. For example, on HumanEval pass@50, small LLMs finetuned on synthetic edit sequences are competitive with GPT-4 and outperform models finetuned on the baseline dataset by +20% (+/-3%) in absolute score. Finally, we also pretrain our own tiny LMs for code understanding. We show that finetuning tiny models on synthetic code edits results in state-of-the-art code synthesis for the on-device model class. Our 150M parameter edit sequence LM matches or outperforms code models with twice as many parameters, both with and without repeated sampling, including Codex and AlphaCode.


CodeMind: A Framework to Challenge Large Language Models for Code Reasoning

Liu, Changshu, Zhang, Shizhuo Dylan, Ibrahimzada, Ali Reza, Jabbarvand, Reyhaneh

arXiv.org Artificial Intelligence

Solely relying on test passing to evaluate Large Language Models (LLMs) for code synthesis may result in unfair assessment or promoting models with data leakage. As an alternative, we introduce CodeMind, a framework designed to gauge the code reasoning abilities of LLMs. CodeMind currently supports three code reasoning tasks: Independent Execution Reasoning (IER), Dependent Execution Reasoning (DER), and Specification Reasoning (SR). The first two evaluate models to predict the execution output of an arbitrary code or code the model could correctly synthesize. The third one evaluates the extent to which LLMs implement the specified expected behavior. Our extensive evaluation of nine LLMs across five benchmarks in two different programming languages using CodeMind shows that LLMs fairly follow control flow constructs and, in general, explain how inputs evolve to output, specifically for simple programs and the ones they can correctly synthesize. However, their performance drops for code with higher complexity, non-trivial logical and arithmetic operators, non-primitive types, and API calls. Furthermore, we observe that, while correlated, specification reasoning (essential for code synthesis) does not imply execution reasoning (essential for broader programming tasks such as testing and debugging): ranking LLMs based on test passing can be different compared to code reasoning.


Enhancing Security of AI-Based Code Synthesis with GitHub Copilot via Cheap and Efficient Prompt-Engineering

Res, Jakub, Homoliak, Ivan, Perešíni, Martin, Smrčka, Aleš, Malinka, Kamil, Hanacek, Petr

arXiv.org Artificial Intelligence

AI assistants for coding are on the rise. However one of the reasons developers and companies avoid harnessing their full potential is the questionable security of the generated code. This paper first reviews the current state-of-the-art and identifies areas for improvement on this issue. Then, we propose a systematic approach based on prompt-altering methods to achieve better code security of (even proprietary black-box) AI-based code generators such as GitHub Copilot, while minimizing the complexity of the application from the user point-of-view, the computational resources, and operational costs. In sum, we propose and evaluate three prompt altering methods: (1) scenario-specific, (2) iterative, and (3) general clause, while we discuss their combination. Contrary to the audit of code security, the latter two of the proposed methods require no expert knowledge from the user. We assess the effectiveness of the proposed methods on the GitHub Copilot using the OpenVPN project in realistic scenarios, and we demonstrate that the proposed methods reduce the number of insecure generated code samples by up to 16\% and increase the number of secure code by up to 8\%. Since our approach does not require access to the internals of the AI models, it can be in general applied to any AI-based code synthesizer, not only GitHub Copilot.


Automatic Unit Test Data Generation and Actor-Critic Reinforcement Learning for Code Synthesis

Gorinski, Philip John, Zimmer, Matthieu, Lampouras, Gerasimos, Deik, Derrick Goh Xin, Iacobacci, Ignacio

arXiv.org Artificial Intelligence

The advent of large pre-trained language models in the domain of Code Synthesis has shown remarkable performance on various benchmarks, treating the problem of Code Generation in a fashion similar to Natural Language Generation, trained with a Language Modelling (LM) objective. In addition, the property of programming language code being precisely evaluable with respect to its semantics -- through the use of Unit Tests to check its functional correctness -- lends itself to using Reinforcement Learning (RL) as a further training paradigm. Previous work has shown that RL can be applied as such to improve models' coding capabilities; however, such RL-based methods rely on a reward signal based on defined Unit Tests, which are much harder to obtain compared to the huge crawled code datasets used in LM objectives. In this work, we present a novel approach to automatically obtain data consisting of function signatures and associated Unit Tests, suitable for RL training of Code Synthesis models. We also introduce a straightforward, simple yet effective Actor-Critic RL training scheme and show that it, in conjunction with automatically generated training data, leads to improvement of a pre-trained code language model's performance by up to 9.9% improvement over the original underlying code synthesis LM, and up to 4.3% over RL-based models trained with standard PPO or CodeRL.


Codex Hacks HackerRank: Memorization Issues and a Framework for Code Synthesis Evaluation

Karmakar, Anjan, Prenner, Julian Aron, D'Ambros, Marco, Robbes, Romain

arXiv.org Artificial Intelligence

The Codex model has demonstrated extraordinary competence in synthesizing code from natural language problem descriptions. However, in order to reveal unknown failure modes and hidden biases, such large-scale models must be systematically subjected to multiple and diverse evaluation studies. In this work, we evaluate the code synthesis capabilities of the Codex model based on a set of 115 Python problem statements from a popular competitive programming portal: HackerRank. Our evaluation shows that Codex is indeed proficient in Python, solving 96% of the problems in a zero-shot setting, and 100% of the problems in a few-shot setting. However, Codex exhibits clear signs of generating memorized code based on our evaluation. This is alarming, especially since the adoption and use of such models could directly impact how code is written and produced in the foreseeable future. With this in mind, we further discuss and highlight some of the prominent risks associated with large-scale models of source code. Finally, we propose a framework for code-synthesis evaluation using variations of problem statements based on mutations.


A Hazard Analysis Framework for Code Synthesis Large Language Models

Khlaaf, Heidy, Mishkin, Pamela, Achiam, Joshua, Krueger, Gretchen, Brundage, Miles

arXiv.org Artificial Intelligence

Codex, a large language model (LLM) trained on a variety of codebases, exceeds the previous state of the art in its capacity to synthesize and generate code. Although Codex provides a plethora of benefits, models that may generate code on such scale have significant limitations, alignment problems, the potential to be misused, and the possibility to increase the rate of progress in technical fields that may themselves have destabilizing impacts or have misuse potential. Yet such safety impacts are not yet known or remain to be explored. In this paper, we outline a hazard analysis framework constructed at OpenAI to uncover hazards or safety risks that the deployment of models like Codex may impose technically, socially, politically, and economically. The analysis is informed by a novel evaluation framework that determines the capacity of advanced code generation techniques against the complexity and expressivity of specification prompts, and their capability to understand and execute them relative to human ability.


Amanuensis: The Programmer's Apprentice

Dean, Thomas, Chiang, Maurice, Gomez, Marcus, Gruver, Nate, Hindy, Yousef, Lam, Michelle, Lu, Peter, Sanchez, Sophia, Saxena, Rohun, Smith, Michael, Wang, Lucy, Wong, Catherine

arXiv.org Artificial Intelligence

Suppose you could merely imagine a computation, and a digital prostheses, an extension of your biological brain, would turn it into code that instantly realizes what you had in mind. Imagine looking at an image, dataset or set of equations and wanting to analyze and explore its meaning as an artistic whim or part of a scientific investigation. I don't mean you would use an existing software suite to produce a standard visualization, but rather you would make use of an extensive repository of existing code to assemble a new program analogous to how a composer draws upon a repertoire of musical motifs, themes and styles to construct new works, and tantamount to having a talented musical amanuensis who, in addition to copying your scores, takes liberties with your prior work, making small alterations here and there and occasionally adding new works of its own invention, novel but consistent with your taste and sensibilities. Perhaps the interaction would be wordless and you would express your objective by simply focusing your attention and guiding your imagination, the prostheses operating directly on patterns of activation arising in your primary sensory, proprioceptive and associative cortex that have become part of an extensive vocabulary that you now share with your personal digital amanuensis. Or perhaps it would involve a conversation conducted in subvocal, unarticulated speech in which you specify what it is you want to compute and your assistant asks questions to clarify your intention and the two of you share examples of input and output to ground your internal conversation in concrete terms. More than thirty years ago, Charles Rich and Richard Waters published an MIT AI Lab technical report [68] entitled The Programmer's Apprentice: A Research Overview. Whether they intended it or not, it would have been easy in those days for someone to misremember the title and inadvertently refer to it as "The Sorcerer's Apprentice" since computer programmers at the time were often characterized as wizards and most children were familiar with the Walt Disney movie Fantasia, featuring music written by Paul Dukas inspired by Goethe's poem of the same name