AITopics | cppo

Collaborating Authors

cppo

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

Neural Information Processing SystemsJun-23-2026, 07:45:01 GMT

This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO).

completion, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Country: Asia > China (0.46)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.68)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Appendix A Continuous RL: Formulation and Well-Posedness 467 A.1 Exploratory Stochastic-Control

Neural Information Processing SystemsFeb-9-2026, 12:06:52 GMT

Assumption 2. The following conditions are assumed throughout: A; (32) (iv) r has polynomial growth in x and a, i.e., there exists a constant C > 0 and µ 1 such that To do so, let's assume Theorem 6. Assume that for a policy π and for every x, Assumption 3. Assume the following conditions hold: Lemma 9. Let π, ˆ π be two feedback policies. We need a lemma for the perturbation bounds. Here we present a detailed version of the CPPO algorithm. D.3 below, which clearly illustrates the advantage of square-root KL divergence.

artificial intelligence, kl-divergence, machine learning, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

2c53bc01e30711a08f6ac86919193022-Supplemental-Conference.pdf

Neural Information Processing SystemsOct-8-2025, 08:54:35 GMT

cppo, equation, kl-divergence, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Policy Optimization for Continuous Reinforcement Learning

Neural Information Processing SystemsOct-8-2025, 08:54:31 GMT

Through numerical experiments, we demonstrate the effectiveness and advantages of our approach.

algorithm, continuous rl, kl-divergence, (14 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

Lin, Zhihang, Lin, Mingbao, Xie, Yuan, Ji, Rongrong

arXiv.org Artificial IntelligenceMar-28-2025

This paper introduces Completion Pruning Policy Optimization (CPPO) to accelerate the training of reasoning models based on Group Relative Policy Optimization (GRPO). GRPO, while effective, incurs high training costs due to the need for sampling multiple completions for each question. Our experiment and theoretical analysis reveals that the number of completions impacts model accuracy yet increases training time multiplicatively, and not all completions contribute equally to policy training -- their contribution depends on their relative advantage. To address these issues, we propose CPPO, which prunes completions with low absolute advantages, significantly reducing the number needed for gradient calculation and updates. Additionally, we introduce a dynamic completion allocation strategy to maximize GPU utilization by incorporating additional questions, further enhancing training efficiency. Experimental results demonstrate that CPPO achieves up to $8.32\times$ speedup on GSM8K and $3.51\times$ on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at https://github.com/lzhxmu/CPPO.

completion, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2503.22342

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > Singapore (0.04)
Asia > China > Fujian Province > Xiamen (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)

Add feedback

Towards Safe Reinforcement Learning via Constraining Conditional Value-at-Risk

Ying, Chengyang, Zhou, Xinning, Su, Hang, Yan, Dong, Chen, Ning, Zhu, Jun

arXiv.org Artificial IntelligenceSep-17-2022

Though deep reinforcement learning (DRL) has obtained substantial success, it may encounter catastrophic failures due to the intrinsic uncertainty of both transition and observation. Most of the existing methods for safe reinforcement learning can only handle transition disturbance or observation disturbance since these two kinds of disturbance affect different parts of the agent; besides, the popular worst-case return may lead to overly pessimistic policies. To address these issues, we first theoretically prove that the performance degradation under transition disturbance and observation disturbance depends on a novel metric of Value Function Range (VFR), which corresponds to the gap in the value function between the best state and the worst state. Based on the analysis, we adopt conditional value-at-risk (CVaR) as an assessment of risk and propose a novel reinforcement learning algorithm of CVaR-Proximal-Policy-Optimization (CPPO) which formalizes the risk-sensitive constrained optimization problem by keeping its CVaR under a given threshold. Experimental results show that CPPO achieves a higher cumulative reward and is more robust against both observation and transition disturbances on a series of continuous control tasks in MuJoCo.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2206.04436

Country: Asia > China > Beijing > Beijing (0.04)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback