Goto

Collaborating Authors

 rpg





RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation

Luo, Jane, Zhang, Xin, Liu, Steven, Wu, Jie, Liu, Jianfeng, Huang, Yiming, Huang, Yangyu, Yin, Chengyu, Xin, Ying, Zhan, Yuefeng, Sun, Hao, Chen, Qi, Li, Scarlett, Yang, Mao

arXiv.org Artificial Intelligence

Large language models excel at generating individual functions or single files of code, yet generating complete repositories from scratch remains a fundamental challenge. This capability is key to building coherent software systems from high-level specifications and realizing the full potential of automated code generation. The process requires planning at two levels: deciding what features and modules to build (proposal stage) and defining their implementation details (implementation stage). Current approaches rely on natural language planning, which often produces unclear specifications, misaligned components, and brittle designs due to its inherent ambiguity and lack of structure. To address these limitations, we introduce the Repository Planning Graph (RPG), a structured representation that encodes capabilities, file structures, data flows, and functions in a unified graph. By replacing free-form natural language with an explicit blueprint, RPG enables consistent long-horizon planning for repository generation. Building on RPG, we develop ZeroRepo, a graph-driven framework that operates in three stages: proposal-level planning, implementation-level construction, and graph-guided code generation with test validation. To evaluate, we construct RepoCraft, a benchmark of six real-world projects with 1,052 tasks. On RepoCraft, ZeroRepo produces nearly 36K Code Lines and 445K Code Tokens, on average 3.9$\times$ larger than the strongest baseline (Claude Code), and 68$\times$ larger than other baselines. It achieves 81.5% coverage and 69.7% test accuracy, improving over Claude Code by 27.3 and 35.8 points. Further analysis shows that RPG models complex dependencies, enables more sophisticated planning through near-linear scaling, and improves agent understanding of repositories, thus accelerating localization.





Reparameterization Proximal Policy Optimization

Zhong, Hai, Wang, Xun, Li, Zhuoran, Huang, Longbo

arXiv.org Artificial Intelligence

Reparameterization policy gradient (RPG) is promising for improving sample efficiency by leveraging differentiable dynamics. However, a critical barrier is its training instability, where high-variance gradients can destabilize the learning process. To address this, we draw inspiration from Proximal Policy Optimization (PPO), which uses a surrogate objective to enable stable sample reuse in the model-free setting. We first establish a connection between this surrogate objective and RPG, which has been largely unexplored and is non-trivial. Then, we bridge this gap by demonstrating that the reparameterization gradient of a PPO-like surrogate objective can be computed efficiently using backpropagation through time. Based on this key insight, we propose Reparameterization Proximal Policy Optimization (RPO), a stable and sample-efficient RPG-based method. RPO enables stable sample reuse over multiple epochs by employing a policy gradient clipping mechanism tailored for RPG. It is further stabilized by Kullback-Leibler (KL) divergence regularization and remains fully compatible with existing variance reduction methods. We evaluate RPO on a suite of challenging locomotion and manipulation tasks, where experiments demonstrate that our method achieves superior sample efficiency and strong performance.


From General Relation Patterns to Task-Specific Decision-Making in Continual Multi-Agent Coordination

Yao, Chang, Lin, Youfang, Song, Shoucheng, Wu, Hao, Ma, Yuqing, Han, Shang, Lv, Kai

arXiv.org Artificial Intelligence

Continual Multi-Agent Reinforcement Learning (Co-MARL) requires agents to address catastrophic forgetting issues while learning new coordination policies with the dynamics team. In this paper, we delve into the core of Co-MARL, namely Relation Patterns, which refer to agents' general understanding of interactions. In addition to generality, relation patterns exhibit task-specificity when mapped to different action spaces. To this end, we propose a novel method called General Relation Patterns-Guided Task-Specific Decision-Maker (RPG). In RPG, agents extract relation patterns from dynamic observation spaces using a relation capturer. These task-agnostic relation patterns are then mapped to different action spaces via a task-specific decision-maker generated by a conditional hypernetwork. To combat forgetting, we further introduce regularization items on both the relation capturer and the conditional hypernetwork. Results on SMAC and LBF demonstrate that RPG effectively prevents catastrophic forgetting when learning new tasks and achieves zero-shot generalization to unseen tasks.


Reusing Trajectories in Policy Gradients Enables Fast Convergence

Montenegro, Alessandro, Mansutti, Federico, Mussi, Marco, Papini, Matteo, Metelli, Alberto Maria

arXiv.org Artificial Intelligence

Policy gradient (PG) methods are a class of effective reinforcement learning algorithms, particularly when dealing with continuous control problems. These methods learn the parameters of parametric policies via stochastic gradient ascent, typically using on-policy trajectory data to estimate the policy gradient. However, such reliance on fresh data makes them sample-inefficient. Indeed, vanilla PG methods require $O(ε^{-2})$ trajectories to reach an $ε$-approximate stationary point. A common strategy to improve efficiency is to reuse off-policy information from past iterations, such as previous gradients or trajectories. While gradient reuse has received substantial theoretical attention, leading to improved rates of $O(ε^{-3/2})$, the reuse of past trajectories remains largely unexplored from a theoretical perspective. In this work, we provide the first rigorous theoretical evidence that extensive reuse of past off-policy trajectories can significantly accelerate convergence in PG methods. We introduce a power mean correction to the multiple importance weighting estimator and propose RPG (Retrospective Policy Gradient), a PG algorithm that combines old and new trajectories for policy updates. Through a novel analysis, we show that, under established assumptions, RPG achieves a sample complexity of $\widetilde{O}(ε^{-1})$, the best known rate in the literature. We further validate empirically our approach against PG methods with state-of-the-art rates.