Goto

Collaborating Authors

 Automatic Programming


Code Generation as a Dual Task of Code Summarization

Neural Information Processing Systems

Code summarization (CS) and code generation (CG) are two crucial tasks in the field of automatic software development. Various neural network-based approaches are proposed to solve these two tasks separately. However, there exists a specific intuitive correlation between CS and CG, which has not been exploited in previous work. In this paper, we apply the relations between two tasks to improve the performance of both tasks. In other words, exploiting the duality between the two tasks, we propose a dual training framework to train the two tasks simultaneously. In this framework, we consider the dualities on probability and attention weights, and design corresponding regularization terms to constrain the duality. We evaluate our approach on two datasets collected from GitHub, and experimental results show that our dual framework can improve the performance of CS and CG tasks over baselines.


CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have made significant strides in code generation and problem solving. Current approaches employ external tool-based iterative debuggers that use compiler or other tool-based runtime feedback to refine coarse programs generated by various methods. However, the effectiveness of these approaches heavily relies on the quality of the initial code generation, which remains an open challenge. In this paper, we introduce CodeSim, a novel multi-agent code generation framework that comprehensively addresses the stages of program synthesis-planning, coding, and debugging-through a human-like perception approach. As human verifies their understanding of any algorithms through visual simulation, CodeSim uniquely features a method of plan verification and internal debugging through the step-by-step simulation of input/output. Extensive experiments across seven challenging competitive problem-solving and program synthesis benchmarks demonstrate CodeSim's remarkable code generation capabilities. Our framework achieves new state-of-the-art (pass@1) results-(HumanEval 95.1%, MBPP 90.7%, APPS 22%, and CodeContests 29.1%). Furthermore, our method shows potential for even greater enhancement when cascaded with external debuggers. To facilitate further research and development in this area, we have open-sourced our framework in this link (https://kagnlp.github.io/codesim.github.io/).


Proving the Coding Interview: A Benchmark for Formally Verified Code Generation

arXiv.org Artificial Intelligence

We introduce the Formally Verified Automated Programming Progress Standards, or FVAPPS, a benchmark of 4715 samples for writing programs and proving their correctness, the largest formal verification benchmark, including 1083 curated and quality controlled samples. Previously, APPS provided a benchmark and dataset for programming puzzles to be completed in Python and checked against unit tests, of the kind seen in technical assessments in the software engineering industry. Building upon recent approaches for benchmarks in interactive theorem proving, we generalize the unit tests to Lean 4 theorems given without proof (i.e., using Lean's "sorry" keyword). On the 406 theorems of 100 randomly selected samples, Sonnet correctly proves 30% and Gemini correctly proves 18%. We challenge the machine learning and program synthesis communities to solve both each general purpose programming problem and its associated correctness specifications. The benchmark is available at https://huggingface.co/datasets/quinn-dougherty/fvapps.


CodeSCM: Causal Analysis for Multi-Modal Code Generation

arXiv.org Artificial Intelligence

In this paper, we propose CodeSCM, a Structural Causal Model (SCM) for analyzing multi-modal code generation using large language models (LLMs). By applying interventions to CodeSCM, we measure the causal effects of different prompt modalities, such as natural language, code, and input-output examples, on the model. CodeSCM introduces latent mediator variables to separate the code and natural language semantics of a multi-modal code generation prompt. Using the principles of Causal Mediation Analysis on these mediators we quantify direct effects representing the model's spurious leanings. We find that, in addition to natural language instructions, input-output examples significantly influence code generation.


Large Language Model Guided Self-Debugging Code Generation

arXiv.org Artificial Intelligence

Automated code generation is gaining significant importance in intelligent computer programming and system deployment. However, current approaches often face challenges in computational efficiency and lack robust mechanisms for code parsing and error correction. In this work, we propose a novel framework, PyCapsule, with a simple yet effective two-agent pipeline and efficient self-debugging modules for Python code generation. PyCapsule features sophisticated prompt inference, iterative error handling, and case testing, ensuring high generation stability, safety, and correctness. Empirically, PyCapsule achieves up to 5.7% improvement of success rate on HumanEval, 10.3% on HumanEval-ET, and 24.4% on BigCodeBench compared to the state-of-art methods. We also observe a decrease in normalized success rate given more self-debugging attempts, potentially affected by limited and noisy error feedback in retention. PyCapsule demonstrates broader impacts on advancing lightweight and efficient code generation for artificial intelligence systems.


Process-Supervised Reinforcement Learning for Code Generation

arXiv.org Artificial Intelligence

Existing reinforcement learning strategies based on outcome supervision have proven effective in enhancing the performance of large language models(LLMs) for code generation. While reinforcement learning based on process supervision has shown great promise in handling multi-step reasoning tasks, its effectiveness in code generation remains largely underexplored and underjustified. The primary obstacle stems from the resource-intensive nature of constructing high-quality process-supervised data, which demands substantial human expertise and computational resources. In response to this challenge, we propose a "statement mutation/refactoring-compile and execution verification" strategy: mutating and refactoring code line-by-line through a teacher model, and utilizing compiler execution results to automatically label each line, resulting in line-by-line process-supervised data, which is pivotal for training a process-supervised reward model. The trained reward model is then integrated into the PRLCoder framework, followed by experimental validation on several benchmarks. Experimental results demonstrate that process-supervised reinforcement learning significantly surpasses methods relying solely on outcome supervision. Notably, in tackling complex code generation tasks, process-supervised reinforcement learning shows a clear advantage, ensuring both the integrity of the code generation process and the correctness of the generation results.


Reviews: Code Generation as a Dual Task of Code Summarization

Neural Information Processing Systems

This paper presents an interesting approach of using the duality relationship between Code Summarization (CS) and Code Generation (CG) to improve the performance of a neural model on both tasks simultaneously. The main idea is to exploit the fact that the conditional probability of a comment given some source code, and the conditional probability of source code given a comment, are both related by their common joint probability. Moreover, since both the tasks of CS and CG use an attention-based seq2seq architecture, this paper also proposes to add an additional constraint that the two attention vectors have similar distributions, i.e. the attention weight of comment word i to source token j for the CS task is similar to the attention weights of the same pair for the CG task. The method is evaluated on two datasets of Java and Python programs/comment pairs and the dual training outperforms several baseline methods including the same architecture trained without dual constraints (basic model). Overall, I liked the idea of exploiting the dual relationship between the code summarization and code generation tasks. The proposed dual regularization terms relating to the factorization of conditional probability distributions and similarity of attention matrices are quite elegant.


Reviews: Code Generation as a Dual Task of Code Summarization

Neural Information Processing Systems

All reviewers liked the dual relationship between the code summarization and code generation tasks. They were also satisfied with the implementation and experiments. These problems are both difficult and important, hence progress is of interest even if such dualities have been identified in other contexts.


A Additional Details and Results for Code Generation Experiments

Neural Information Processing Systems

Figure 7: Actual example of how an anchor function impacts the generated solution. We construct the anchor function by taking the function signature from the HuamnEval prompt (blue), removing the docstring and variable typing, appending n lines of the canonical solution (green), then adding anchoring lines (red). We prompt Codex with the anchor function, the HumanEval prompt, and the first n lines of the canonical solution (above black line). The full canonical solution is on the right (green text, grey box). We see that the solution Codex generates (below black line) combines elements of the canonical solution (e.g.


Chain of Grounded Objectives: Bridging Process and Goal-oriented Prompting for Code Generation

arXiv.org Artificial Intelligence

The use of Large Language Models (LLMs) for code generation has gained significant attention in recent years. Existing methods often aim to improve the quality of generated code by incorporating additional contextual information or guidance into input prompts. Many of these approaches adopt sequential reasoning strategies, mimicking human-like step-by-step thinking. However, such strategies may constrain flexibility, as they do not always align with the structured characteristics of programming languages. This paper introduces the Chain of Grounded Objectives (CGO), a method that embeds functional objectives into input prompts to enhance code generation. By leveraging appropriately structured objectives as input and avoiding explicit sequential procedures, CGO adapts effectively to the structured nature of programming tasks. Empirical evaluations demonstrate that CGO effectively enhances code generation, addressing limitations of existing approaches.