Wang, Pengcheng
Residual Policy Gradient: A Reward View of KL-regularized Objective
Wang, Pengcheng, Zhu, Xinghao, Chen, Yuxin, Xu, Chenfeng, Tomizuka, Masayoshi, Li, Chenran
Reinforcement Learning and Imitation Learning have achieved widespread success in many domains but remain constrained during real-world deployment. One of the main issues is the additional requirements that were not considered during training. To address this challenge, policy customization has been introduced, aiming to adapt a prior policy while preserving its inherent properties and meeting new task-specific requirements. A principled approach to policy customization is Residual Q-Learning (RQL), which formulates the problem as a Markov Decision Process (MDP) and derives a family of value-based learning algorithms. However, RQL has not yet been applied to policy gradient methods, which restricts its applicability, especially in tasks where policy gradient has already proven more effective. In this work, we first derive a concise form of Soft Policy Gradient as a preliminary. Building on this, we introduce Residual Policy Gradient (RPG), which extends RQL to policy gradient methods, allowing policy customization in gradient-based RL settings. With the view of RPG, we rethink the KL-regularized objective widely used in RL fine-tuning. We show that under certain assumptions, KL-regularized objective leads to a maximum-entropy policy that balances the inherent properties and task-specific requirements on a reward-level. Our experiments in MuJoCo demonstrate the effectiveness of Soft Policy Gradient and Residual Policy Gradient.
TD-M(PC)$^2$: Improving Temporal Difference MPC Through Policy Constraint
Lin, Haotian, Wang, Pengcheng, Schneider, Jeff, Shi, Guanya
Through theoretical analysis in TD-MPC implementation leads to persistent value and experiments, we argue that this issue is deeply rooted overestimation. It is also empirically observed that the performance in the structural policy mismatch between the data generation of TD-MPC2 is far from satisfactory at some policy that is always bootstrapped by the planner and high-dimensional locomotion tasks [33]. This phenomenon the learned policy prior. To mitigate such a mismatch in is closely connected to, yet distinct from, the well-known a minimalist way, we propose a policy regularization term overestimation bias arising from function approximation reducing out-of-distribution (OOD) queries, thereby improving errors and error accumulation in temporal difference learning value learning. Our method involves minimum changes [39, 37, 7]. More precisely, we identify the underlying on top of existing frameworks and requires no additional issue as policy mismatch. The behavior policy generated by computation. Extensive experiments demonstrate that the the MPC planner governs data collection, creating a buffered proposed approach improves performance over baselines data distribution that does not directly align with the learned such as TD-MPC2 by large margins, particularly in 61-DoF value or policy prior.
Human-Guided Image Generation for Expanding Small-Scale Training Image Datasets
Chen, Changjian, Lv, Fei, Guan, Yalong, Wang, Pengcheng, Yu, Shengjie, Zhang, Yifan, Tang, Zhuo
The performance of computer vision models in certain real-world applications (e.g., rare wildlife observation) is limited by the small number of available images. Expanding datasets using pre-trained generative models is an effective way to address this limitation. However, since the automatic generation process is uncontrollable, the generated images are usually limited in diversity, and some of them are undesired. In this paper, we propose a human-guided image generation method for more controllable dataset expansion. We develop a multi-modal projection method with theoretical guarantees to facilitate the exploration of both the original and generated images. Based on the exploration, users refine the prompts and re-generate images for better performance. Since directly refining the prompts is challenging for novice users, we develop a sample-level prompt refinement method to make it easier. With this method, users only need to provide sample-level feedback (e.g., which samples are undesired) to obtain better prompts. The effectiveness of our method is demonstrated through the quantitative evaluation of the multi-modal projection method, improved model performance in the case study for both classification and object detection tasks, and positive feedback from the experts.
Entropy-Regularized Process Reward Model
Zhang, Hanning, Wang, Pengcheng, Diao, Shizhe, Lin, Yong, Pan, Rui, Dong, Hanze, Zhang, Dylan, Molchanov, Pavlo, Zhang, Tong
Large language models (LLMs) have shown promise in performing complex multi-step reasoning, yet they continue to struggle with mathematical reasoning, often making systematic errors. A promising solution is reinforcement learning (RL) guided by reward models, particularly those focusing on process rewards, which score each intermediate step rather than solely evaluating the final outcome. This approach is more effective at guiding policy models towards correct reasoning trajectories. In this work, we propose an entropy-regularized process reward model (ER-PRM) that integrates KL-regularized Markov Decision Processes (MDP) to balance policy optimization with the need to prevent the policy from shifting too far from its initial distribution. We derive a novel reward construction method based on the theoretical results. Our theoretical analysis shows that we could derive the optimal reward model from the initial policy sampling. Our empirical experiments on the MATH and GSM8K benchmarks demonstrate that ER-PRM consistently outperforms existing process reward models, achieving 1% improvement on GSM8K and 2-3% improvement on MATH under best-of-N evaluation, and more than 1% improvement under RLHF. These results highlight the efficacy of entropy-regularization in enhancing LLMs' reasoning capabilities.
Residual-MPPI: Online Policy Customization for Continuous Control
Wang, Pengcheng, Li, Chenran, Weaver, Catherine, Kawamoto, Kenta, Tomizuka, Masayoshi, Tang, Chen, Zhan, Wei
Policies learned through Reinforcement Learning (RL) and Imitation Learning (IL) have demonstrated significant potential in achieving advanced performance in continuous control tasks. However, in real-world environments, it is often necessary to further customize a trained policy when there are additional requirements that were unforeseen during the original training phase. It is possible to fine-tune the policy to meet the new requirements, but this often requires collecting new data with the added requirements and access to the original training metric and policy parameters. In contrast, an online planning algorithm, if capable of meeting the additional requirements, can eliminate the necessity for extensive training phases and customize the policy without knowledge of the original training scheme or task. In this work, we propose a generic online planning algorithm for customizing continuous-control policies at the execution time which we call Residual-MPPI. It is able to customize a given prior policy on new performance metrics in few-shot and even zero-shot online settings. Also, Residual-MPPI only requires access to the action distribution produced by the prior policy, without additional knowledge regarding the original task. Through our experiments, we demonstrate that the proposed Residual-MPPI algorithm can accomplish the few-shot/zero-shot online policy customization task effectively, including customizing the champion-level racing agent, Gran Turismo Sophy (GT Sophy) 1.0, in the challenging car racing scenario, Gran Turismo Sport (GTS) environment. Demo videos are available on our website: https://sites.google.com/view/residual-mppi
Active Prompting with Chain-of-Thought for Large Language Models
Diao, Shizhe, Wang, Pengcheng, Lin, Yong, Zhang, Tong
The increasing scale of large language models (LLMs) brings emergent abilities to various complex tasks requiring reasoning, such as arithmetic and commonsense reasoning. It is known that the effective design of task-specific prompts is critical for LLMs' ability to produce high-quality answers. In particular, an effective approach for complex question-and-answer tasks is example-based prompting with chain-of-thought (CoT) reasoning, which significantly improves the performance of LLMs. However, current CoT methods rely on a fixed set of human-annotated exemplars, which are not necessarily the most effective examples for different tasks. This paper proposes a new method, Active-Prompt, to adapt LLMs to different tasks with task-specific example prompts (annotated with human-designed CoT reasoning). For this purpose, we propose a solution to the key problem of determining which questions are the most important and helpful ones to annotate from a pool of task-specific queries. By borrowing ideas from the related problem of uncertainty-based active learning, we introduce several metrics to characterize the uncertainty so as to select the most uncertain questions for annotation. Experimental results demonstrate the superiority of our proposed method, achieving state-of-the-art on eight complex reasoning tasks. Further analyses of different uncertainty metrics, pool sizes, zero-shot learning, and accuracy-uncertainty relationship demonstrate the effectiveness of our method. Our code will be available at https://github.com/shizhediao/active-prompt.
Simulating Player Behavior for Data-Driven Interactive Narrative Personalization
Wang, Pengcheng (North Carolina State University) | Rowe, Jonathan (North Carolina State University) | Min, Wookhee (North Carolina State University) | Mott, Bradford (North Carolina State University) | Lester, James (North Carolina State University)
Data-driven approaches to interactive narrative personalization show significant promise for applications in entertainment, training, and education. A common feature of data-driven interactive narrative planning methods is that an enormous amount of training data is required, which is rarely available and expensive to collect from observations of human players. An alternative approach to obtaining data is to generate synthetic data from simulated players. In this paper, we present a long short-term memory (LSTM) neural network framework for simulating players to train data-driven interactive narrative planners. By leveraging a small amount of previously collected human player interaction data, we devise a generative player simulation model. A multi-task neural network architecture is proposed to estimate player actions and experiential outcomes from a single model. Empirical results demonstrate that the bipartite LSTM network produces the better-performing player action prediction models than several baseline techniques, and the multi-task LSTM derives comparable player outcome prediction models within a shorter training time. We also find that synthetic data from the player simulation model contributes to training more effective interactive narrative planners than raw human player data alone.