functionality
Training Language Models to Generate Quality Code with Program Analysis Feedback
Code generation with large language models (LLMs), often termed vibe coding, is increasingly adopted in production but fails to ensure code quality, particularly in security (e.g., SQL injection vulnerabilities) and maintainability (e.g., missing type annotations). Existing methods, such as supervised fine-tuning and rule-based post-processing, rely on labor-intensive annotations or brittle heuristics, limiting their scalability and effectiveness. We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code using program analysis-guided feedback. Specifically, REAL integrates two automated signals: (1) program analysis detecting security or maintainability defects and (2) unit tests ensuring functional correctness. Unlike prior work, our framework is prompt-agnostic and reference-free, enabling scalable supervision without manual intervention. Experiments across multiple datasets and model scales demonstrate that REAL outperforms state-of-the-art methods in simultaneous assessments of functionality and code quality. Our work bridges the gap between rapid prototyping and production-ready code, enabling LLMs to deliver both speed and quality.
WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch
LLM-based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent's ability to create multifile website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications. To assess the quality of the generated websites, we generate test cases targeting each functionality described in the instructions. These test cases are then manually filtered, refined, and organized to ensure accuracy, resulting in a total of 647 test cases. Each test case specifies an operation to be performed on the website and the expected outcome of the operation. To automate testing and improve reproducibility, we employ a powerful web-navigation agent to execute test cases on the generated websites and determine whether the observed responses align with the expected results. We evaluate three high-performance code-agent frameworks--Bolt.diy,
Minigrid & Miniworld: Modular & Customizable Reinforcement Learning Environments for Goal-Oriented Tasks Supplementary Materials
The source code of Minigrid and Miniworld can be found at https://github.com/ To run the experiments, we have implemented the following functionalities: 1. implemented the human trajectory saving for MiniGrid-FourRooms-v0 (copied the ManualControlclass from Minigrid and added 38 lines of code, which are mostly calling data saving functions); 2. implemented the human trajectory saving for MiniWorld-FourRooms-v0 (copied the ManualControlclass from Miniworld and added 45 lines of code, which are mostly calling data saving functions); 3. implemented data saving and plotting for MiniGrid-FourRooms-v0 (33 lines of code, mostly for Matplotlib); 4. implemented data saving and plotting for MiniWorld-FourRooms-v0 (33 lines of code, mostly for Matplotlib). In total, the implementation of this new functionality required 149 lines of code. The source code is hosted on GitHub. We bear all the responsibility in case of violation of rights.
Grounding Representation Similarity with Statistical Testing
To understand neural network behavior, recent works quantitatively compare different networks' learned representations using canonical correlation analysis (CCA), centered kernel alignment (CKA), and other dissimilarity measures. Unfortunately, these widely used measures often disagree on fundamental observations, such as whether deep networks differing only in random initialization learn similar representations. These disagreements raise the question: which, if any, of these dissimilarity measures should we believe? We provide a framework to ground this question through a concrete test: measures should have sensitivity to changes that affect functional behavior, and specificity against changes that do not. We quantify this through a variety of functional behaviors including probing accuracy and robustness to distribution shift, and examine changes such as varying random initialization and deleting principal components. We find that current metrics exhibit different weaknesses, note that a classical baseline performs surprisingly well, and highlight settings where all metrics appear to fail, thus providing a challenge set for further improvement.
Grounding Representation Similarity with Statistical Testing
To understand neural network behavior, recent works quantitatively compare different networks' learned representations using canonical correlation analysis (CCA), centered kernel alignment (CKA), and other dissimilarity measures. Unfortunately, these widely used measures often disagree on fundamental observations, such as whether deep networks differing only in random initialization learn similar representations. These disagreements raise the question: which, if any, of these dissimilarity measures should we believe? We provide a framework to ground this question through a concrete test: measures should have sensitivity to changes that affect functional behavior, and specificity against changes that do not. We quantify this through a variety of functional behaviors including probing accuracy and robustness to distribution shift, and examine changes such as varying random initialization and deleting principal components. We find that current metrics exhibit different weaknesses, note that a classical baseline performs surprisingly well, and highlight settings where all metrics appear to fail, thus providing a challenge set for further improvement.
Resource-sharing boosts robotic resilience
If the goal of a robot is to perform a function, then minimizing the possibility of failure is a top priority when it comes to robotic design. But this minimization is at odds with the robotic raison d'รชtre: systems with multiple units, or agents, can perform more diverse functions, but they also have more different parts that can potentially fail. Researchers led by Jamie Paik, head of the Reconfigurable Robotics Laboratory ( RRL) in EPFL's School of Engineering, have not only circumvented this problem, but flipped it: they have designed a modular robot that actually lowers its odds of failure by sharing resources among its individual agents. "For the first time, we have found a way to reverse the trend of increasing odds of failure with increasing function," Paik explains. "We introduce local resource sharing as a new paradigm in robotics, reducing the failure rate with a larger number of modules."
Robot, make me a chair
"Robot, make me a chair" Computer-aided design (CAD) systems are tried-and-true tools used to design many of the physical objects we use each day. But CAD software requires extensive expertise to master, and many tools incorporate such a high level of detail they don't lend themselves to brainstorming or rapid prototyping. In an effort to make design faster and more accessible for non-experts, researchers from MIT and elsewhere developed an AI-driven robotic assembly system that allows people to build physical objects by simply describing them in words. Their system uses a generative AI model to build a 3D representation of an object's geometry based on the user's prompt. Then, a second generative AI model reasons about the desired object and figures out where different components should go, according to the object's function and geometry.