AITopics | proximal policy optimization

Collaborating Authors

proximal policy optimization

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

DNA: Proximal Policy Optimization with a Dual Network Architecture

Neural Information Processing SystemsMar-20-2026, 13:56:12 GMT

This paper explores the problem of simultaneously learning a value function and policy in deep actor-critic reinforcement learning models. We find that the common practice of learning these functions jointly is sub-optimal due to an order-of-magnitude difference in noise levels between the two tasks. Instead, we show that learning these tasks independently, but with a constrained distillation phase, significantly improves performance. Furthermore, we find that policy gradient noise levels decrease when using a lower \textit{variance} return estimate.

artificial intelligence, machine learning, proceedings, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.41)

Add feedback

Agile Interception of a Flying Target using Competitive Reinforcement Learning

Gavin, Timothée, Lacroix, Simon, Bronz, Murat

arXiv.org Machine LearningMar-18-2026

The interception of agile aerial targets using autonomous drones is a challenging and increasingly relevant problem in robotics and security. The increasing presence of unmanned aerial vehicles (UAVs) in unauthorized, restricted airspaces poses significant safety and security risks and has spurred interest in developing effective interception strategies [1] In particular, scenarios such as airspace protection, infrastructure security, and event safety require the ability to capture or neutralize unauthorized drones with high precision and minimal collateral risk. Deploying interceptor drones equipped with nets is apromising approach, but it demandsadvanced control capabilities to match or exceed the agility of evasive targets. Traditional interception methods often rely on accurate models, preplanned strategies, or predictable target behaviour [2]. However, modern quadrotor drones can perform highly dynamic manoeuvres, and will actively evade capture, rendering their trajectories unpredictable and challenging the effectiveness of classical methods [3].

artificial intelligence, machine learning, reinforcement learning, (11 more...)

arXiv.org Machine Learning

2603.16279

Country:

Europe > France > Occitanie > Haute-Garonne > Toulouse (0.05)
North America > United States > Florida > Palm Beach County > Boca Raton (0.04)

Genre: Research Report (0.42)

Industry: Information Technology > Robotics & Automation (0.34)

Technology:

Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

A Algorithm

Neural Information Processing SystemsFeb-13-2026, 19:16:23 GMT

This section consists of three parts, with each subsequent part building upon the previous one. Appendix A.1 covers the fundamentals of RL, where the actor-critic method is introduced. Appendix A.2 describes the RL algorithm for a single fulfillment agent, which is the proximal policy Appendix A.3 presents the MARL algorithm for the Currently, policy-based methods [Deisenroth et al., 2013] are prevalent because they are compatible with stochastic To sum up, the complete procedure is given in Algorithm 1.Algorithm 1 Heterogeneous Multi-Agent Reinforcement Learning for Order Fulfillment. With regard to the advantage estimator, we set the GAE parameters [Schulman et al., 2016] To highlight how our proposed benchmark differs from existing approaches focused on sub-tasks of order fulfillment, we compare the objectives, observations, and actions in Table 1. It should be noted that multiple formulations exist for each sub-task.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Post: Device Placement with Cross-Entropy Minimization and Proximal Policy Optimization

Yuanxiang Gao, Li Chen, Baochun Li

Neural Information Processing SystemsFeb-13-2026, 18:36:25 GMT

Neural Information Processing Systems http://nips.cc/

cross-entropy minimization, placement, proximal policy optimization, (13 more...)

Neural Information Processing Systems

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Louisiana (0.04)
North America > Canada > Quebec > Montreal (0.04)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.47)

Add feedback

e95475f5fb8edb9075bf9e25670d4013-Paper-Conference.pdf

Neural Information Processing SystemsFeb-12-2026, 14:42:06 GMT

learning, noise level, policy gradient, (12 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.69)

Industry: Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.99)

Add feedback

Gradient Informed Proximal Policy Optimization

Neural Information Processing SystemsDec-24-2025, 02:48:42 GMT

We introduce a novel policy learning method that integrates analytical gradients from differentiable environments with the Proximal Policy Optimization (PPO) algorithm. To incorporate analytical gradients into the PPO framework, we introduce the concept of an α-policy that stands as a locally superior policy. By adaptively modifying the α value, we can effectively manage the influence of analytical policy gradients during learning. To this end, we suggest metrics for assessing the variance and bias of analytical gradients, reducing dependence on these gradients when high variance or bias is detected. Our proposed approach outperforms baseline algorithms in various scenarios, such as function optimization, physics simulations, and traffic control environments. Our code can be found online: https://github.com/SonSang/gippo.

gradient informed proximal policy optimization, name change, proximal policy optimization, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.80)

Add feedback

Reward Scale Robustness for Proximal Policy Optimization via DreamerV3 Tricks

Neural Information Processing SystemsDec-23-2025, 18:08:36 GMT

Most reinforcement learning methods rely heavily on dense, well-normalized environment rewards. DreamerV3 recently introduced a model-based method with a number of tricks that mitigate these limitations, achieving state-of-the-art on a wide range of benchmarks with a single set of hyperparameters. This result sparked discussion about the generality of the tricks, since they appear to be applicable to other reinforcement learning algorithms. Our work applies DreamerV3's tricks to PPO and is the first such empirical study outside of the original work. Surprisingly, we find that the tricks presented do not transfer as general improvements to PPO. We use a high quality PPO reference implementation and present extensive ablation studies totaling over 10,000 A100 hours on the Arcade Learning Environment and the DeepMind Control Suite. Though our experiments demonstrate that these tricks do not generally outperform PPO, we identify cases where they succeed and offer insight into the relationship between the implementation tricks. In particular, PPO with these tricks performs comparably to PPO on Atari games with reward clipping and significantly outperforms PPO without reward clipping.

name change, proximal policy optimization, reward scale robustness, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Learning When to Ask: Simulation-Trained Humanoids for Mental-Health Diagnosis

Cenacchi, Filippo, Richards, Deborah, Cao, Longbing

arXiv.org Artificial IntelligenceDec-11-2025

Testing humanoid robots with users is slow, causes wear, and limits iteration and diversity. Yet screening agents must master conversational timing, prosody, backchannels, and what to attend to in faces and speech for Depression and PTSD. Most simulators omit policy learning with nonverbal dynamics; many controllers chase task accuracy while underweighting trust, pacing, and rapport. We virtualise the humanoid as a conversational agent to train without hardware burden. Our agent-centred, simulation-first pipeline turns interview data into 276 Unreal Engine MetaHuman patients with synchronised speech, gaze/face, and head-torso poses, plus PHQ-8 and PCL-C flows. A perception-fusion-policy loop decides what and when to speak, when to backchannel, and how to avoid interruptions, under a safety shield. Training uses counterfactual replay (bounded nonverbal perturbations) and an uncertainty-aware turn manager that probes to reduce diagnostic ambiguity. Results are simulation-only; the humanoid is the transfer target. In comparing three controllers, a custom TD3 (Twin Delayed DDPG) outperformed PPO and CEM, achieving near-ceiling coverage with steadier pace at comparable rewards. Decision-quality analyses show negligible turn overlap, aligned cut timing, fewer clarification prompts, and shorter waits. Performance stays stable under modality dropout and a renderer swap, and rankings hold on a held-out patient split. Contributions: (1) an agent-centred simulator that turns interviews into 276 interactive patients with bounded nonverbal counterfactuals; (2) a safe learning loop that treats timing and rapport as first-class control variables; (3) a comparative study (TD3 vs PPO/CEM) with clear gains in completeness and social timing; and (4) ablations and robustness analyses explaining the gains and enabling clinician-supervised humanoid pilots.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2512.08952

Country:

Europe > Middle East > Cyprus (0.16)
Oceania > Australia (0.14)

Genre: Research Report (0.50)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

Optimizing Day-Ahead Energy Trading with Proximal Policy Optimization and Blockchain

Verma, Navneet, Xie, Ying

arXiv.org Artificial IntelligenceDec-9-2025

The increasing penetration of renewable energy sources in day-ahead energy markets introduces challenges in balancing supply and demand, ensuring grid resilience, and maintaining trust in decentralized trading systems. This paper proposes a novel framework that integrates the Proximal Policy Optimization (PPO) algorithm, a state-of-the-art reinforcement learning method, with blockchain technology to optimize automated trading strategies for prosumers in day-ahead energy markets. We introduce a comprehensive framework that employs a Reinforcement Learning (RL) agent for multi-objective energy optimization and blockchain for tamper-proof data and transaction management. Simulations using real-world data from the Electricity Reliability Council of Texas (ERCOT) demonstrate the effectiveness of our approach. The RL agent achieves demand-supply balancing within 2% of the demand and maintains near-optimal supply costs for the majority of the operating hours. Moreover, it generates robust battery storage policies capable of handling variability in solar and wind generation. All decisions are recorded on an Algorand-based blockchain, ensuring transparency, au-ditability, and security - key enablers for trustworthy multi-agent energy trading. Our key contributions are a novel system architecture, the use of curriculum learning to train the RL agent, and policy insights that support real-world deployment.

artificial intelligence, machine learning, reinforcement learning, (12 more...)

arXiv.org Artificial Intelligence

2508.01888

Country: North America > United States > Texas (0.25)

Genre: Research Report (0.50)

Industry: