AITopics | stochastic policy

Collaborating Authors

stochastic policy

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Surrogate Objectives for Batch Policy Optimization in One-step Decision Making

Minmin Chen, Ramki Gummadi, Chris Harris, Dale Schuurmans

Neural Information Processing SystemsFeb-12-2026, 19:23:21 GMT

Whenrewardsare fully observed, we show that the expected reward objectiveexhibits suboptimal plateaus and exponentially many local optima in the worst case.

artificial intelligence, machine learning, objective, (14 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.05)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

cc8090c4d2791cdd9cd2cb3c24296190-Paper.pdf

Neural Information Processing SystemsFeb-11-2026, 05:13:22 GMT

learning, loss function, optimization, (15 more...)

Neural Information Processing Systems

Country: North America > United States > Illinois > Cook County > Chicago (0.04)

Genre: Research Report (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.72)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

75df63609809c7a2052fdffe5c00a84e-Paper.pdf

Neural Information Processing SystemsFeb-8-2026, 23:56:58 GMT

deterministic policy, estimation, estimator, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Wisconsin (0.04)
North America > United States > New Jersey > Mercer County > Princeton (0.04)
North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
North America > Canada (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

2c6a0bae0f071cbbf0bb3d5b11d90a82-Supplemental.pdf

Neural Information Processing SystemsFeb-7-2026, 22:53:00 GMT

agent, dqn, m-dqn, (13 more...)

Neural Information Processing Systems

Country:

North America > Canada (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Russia (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)

Add feedback

Defining and Characterizing Reward Gaming

Neural Information Processing SystemsDec-24-2025, 02:02:55 GMT

We provide the first formal definition of \textbf{reward hacking}, a phenomenon where optimizing an imperfect proxy reward function, $\mathcal{\tilde{R}}$, leads to poor performance according to the true reward function, $\mathcal{R}$. We say that a proxy is \textbf{unhackable} if increasing the expected proxy return can never decrease the expected true return.Intuitively, it might be possible to create an unhackable proxy by leaving some terms out of the reward function (making it ``narrower'') or overlooking fine-grained distinctions between roughly equivalent outcomes, but we show this is usually not the case.A key insight is that the linearity of reward (in state-action visit counts) makes unhackability a very strong condition. In particular, for the set of all stochastic policies, two reward functions can only be unhackable if one of them is constant.We thus turn our attention to deterministic policies and finite sets of stochastic policies, where non-trivial unhackable pairs always exist, and establish necessary and sufficient conditions for the existence of simplifications, an important special case of unhackability.Our results reveal a tension between using reward functions to specify narrow tasks and aligning AI systems with human values.

characterizing reward gaming, name change, reward function, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.40)

Add feedback

Marginal Utility for Planning in Continuous or Large Discrete Action Spaces

Neural Information Processing SystemsDec-23-2025, 19:06:59 GMT

Sample-based planning is a powerful family of algorithms for generating intelligent behavior from a model of the environment. Generating good candidate actions is critical to the success of sample-based planners, particularly in continuous or large action spaces. Typically, candidate action generation exhausts the action space, uses domain knowledge, or more recently, involves learning a stochastic policy to provide such search guidance. In this paper we explore explicitly learning a candidate action generator by optimizing a novel objective, marginal utility. The marginal utility of an action generator measures the increase in value of an action over previously generated actions.

discrete action space, marginal utility, name change, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.58)

Add feedback

Multi-Agent Cross-Entropy Method with Monotonic Nonlinear Critic Decomposition

Wang, Yan, Deng, Ke, Ren, Yongli

arXiv.org Artificial IntelligenceNov-27-2025

Cooperative multi-agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution (CTDE), where centralized critics leverage global information to guide decentralized actors. However, centralized-decentralized mismatch (CDM) arises when the suboptimal behavior of one agent degrades others' learning. Prior approaches mitigate CDM through value decomposition, but linear decompositions allow per-agent gradients at the cost of limited expressiveness, while nonlinear decompositions improve representation but require centralized gradients, reintroducing CDM. To overcome this trade-off, we propose the multi-agent cross-entropy method (MCEM), combined with monotonic nonlinear critic decomposition (NCD). MCEM updates policies by increasing the probability of high-value joint actions, thereby excluding suboptimal behaviors. For sample efficiency, we extend off-policy learning with a modified k-step return and Retrace. Analysis and experiments demonstrate that MCEM outperforms state-of-the-art methods across both continuous and discrete action benchmarks.

agent, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2511.18671

Country: North America (0.28)

Genre:

Research Report > Promising Solution (0.48)
Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Conformal Constrained Policy Optimization for Cost-Effective LLM Agents

Si, Wenwen, Jang, Sooyong, Lee, Insup, Bastani, Osbert

arXiv.org Artificial IntelligenceNov-18-2025

While large language models (LLMs) have recently made tremendous progress towards solving challenging AI problems, they have done so at increasingly steep computational and API costs. We propose a novel strategy where we combine multiple LLM models with varying cost/accuracy tradeoffs in an agentic manner, where models and tools are run in sequence as determined by an orchestration model to minimize cost subject to a user-specified level of reliability; this constraint is formalized using conformal prediction to provide guarantees. To solve this problem, we propose Conformal Constrained Policy Optimization (CCPO), a training paradigm that integrates constrained policy optimization with off-policy reinforcement learning and recent advances in online conformal prediction. CCPO jointly optimizes a cost-aware policy (score function) and an adaptive threshold. Across two multi-hop question answering benchmarks, CCPO achieves up to a 30% cost reduction compared to other cost-aware baselines and LLM-guided methods without compromising reliability. Our approach provides a principled and practical framework for deploying LLM agents that are significantly more cost-effective while maintaining reliability.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.11828

Country: Asia (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Reinforcing Diffusion Models by Direct Group Preference Optimization

Luo, Yihong, Hu, Tianyang, Tang, Jing

arXiv.org Artificial IntelligenceOct-10-2025

While reinforcement learning methods such as Group Relative Preference Optimization (GRPO) have significantly enhanced Large Language Models, adapting them to diffusion models remains challenging. In particular, GRPO demands a stochastic policy, yet the most cost-effective diffusion samplers are based on deterministic ODEs. Recent work addresses this issue by using inefficient SDE-based samplers to induce stochasticity, but this reliance on model-agnostic Gaussian noise leads to slow convergence. To resolve this conflict, we propose Direct Group Preference Optimization (DGPO), a new online RL algorithm that dispenses with the policy-gradient framework entirely. DGPO learns directly from group-level preferences, which utilize relative information of samples within groups. This design eliminates the need for inefficient stochastic policies, unlocking the use of efficient deterministic ODE samplers and faster training. Extensive results show that DGPO trains around 20 times faster than existing state-of-the-art methods and achieves superior performance on both in-domain and out-of-domain reward metrics. Code is available at https://github.com/Luo-Yihong/DGPO.

arxiv preprint arxiv, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2510.08425

Genre: Research Report > New Finding (0.34)

Technology: