AITopics | Reinforcement Learning

Beyond the Return: Off-policy Function Estimation under User-specified Error-measuring Distributions

Neural Information Processing SystemsApr-25-2026, 05:10:25 GMT

Off-policy evaluation often refers to two related tasks: estimating the expected return of a policy and estimating its value function (or other functions of interest, such as density ratios). While recent works on marginalized importance sampling (MIS) show that the former can enjoy provable guarantees under realizable function approximation, the latter is only known to be feasible under much stronger assumptions such as prohibitively expressive discriminators. In this work, we provide guarantees for off-policy function estimation under only realizability, by imposing proper regularization on the MIS objectives. Compared to commonly used regularization in MIS, our regularizer is much more flexible and can account for an arbitrary user-specified distribution, under which the learned function will be close to the groundtruth. We provide exact characterization of the optimal dual solution that needs to be realized by the discriminator class, which determines the datacoverage assumption in the case of value-function learning. As another surprising observation, the regularizer can be altered to relax the data-coverage requirement, and completely eliminate it in the ideal case with strong side information.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.48)

Add feedback

Offline RLWithout Off-Policy Evaluation

Neural Information Processing SystemsApr-25-2026, 04:38:44 GMT

Most prior approaches to offline reinforcement learning (RL) have taken an iterative actor-critic approach involving off-policy evaluation. In this paper we show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well. This onestep algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark. The one-step baseline achieves this strong performance while being notably simpler and more robust to hyperparameters than previously proposed iterative algorithms. We argue that the relatively poor performance of iterative approaches is a result of the high variance inherent in doing off-policy evaluation and magnified by the repeated optimization of policies against those estimates. In addition, we hypothesize that the strong performance of the one-step algorithm is due to a combination of favorable structure in the environment and behavior policy.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)

Add feedback

286a371d8a0a559281f682f8fbf89834-Paper-Conference.pdf

Neural Information Processing SystemsApr-25-2026, 04:38:16 GMT

artificial intelligence, machine learning, reinforcement learning, (18 more...)

Neural Information Processing Systems

Country: Asia > China (0.15)

Genre: Research Report (0.93)

Industry:

Transportation > Ground > Road (0.96)
Information Technology (0.71)
Energy (0.68)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Sensing and Signal Processing (0.93)
(3 more...)

Add feedback

Medical Dead-ends and Learning to Identify High-risk States and Treatments

Neural Information Processing SystemsApr-25-2026, 04:20:27 GMT

Machine learning has successfully framed many sequential decision making problems as either supervised prediction, or optimal decision-making policy identification via reinforcement learning. In data-constrained offline settings, both approaches may fail as they assume fully optimal behavior or rely on exploring alternatives that may not exist. We introduce an inherently different approach that identifies possible "dead-ends" of a state space. We focus on the condition of patients in the intensive care unit, where a "medical dead-end" indicates that a patient will expire, regardless of all potential future treatment sequences. We postulate "treatment security" as avoiding treatments with probability proportional to their chance of leading to dead-ends, present a formal proof, and frame discovery as an RL problem. We then train three independent deep neural models for automated state construction, dead-end discovery and confirmation. Our empirical results discover that dead-ends exist in real clinical data among septic patients, and further reveal gaps between secure treatments and those that were administered.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

Neural Information Processing Systems

Country: North America > Canada > Ontario (0.28)

Genre:

Research Report > Experimental Study (0.66)
Research Report > New Finding (0.46)

Industry:

Health & Medicine > Health Care Providers & Services (0.66)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.30)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)

Add feedback

Instance-Dependent Near-Optimal Policy Identification in Linear MDPs via Online Experiment Design

Neural Information Processing SystemsApr-25-2026, 04:01:30 GMT

While much progress has been made in understanding the minimax sample complexity of reinforcement learning (RL)--the complexity of learning on the "worstcase" instance--such measures of complexity often do not capture the true difficulty of learning. In practice, on an "easy" instance, we might hope to achieve a complexity far better than that achievable on the worst-case instance. In this work we seek to understand the "instance-dependent" complexity of learning near-optimal policies (PACRL) in the setting of RL with linear function approximation. We propose an algorithm, PEDEL, which achieves a fine-grained instance-dependent measure of complexity, the first of its kind in the RL with function approximation setting, thereby capturing the difficulty of learning on each particular problem instance. Through an explicit example, we show that PEDEL yields provable gains over low-regret, minimax-optimal algorithms and that such algorithms are unable to hit the instance-optimal rate. Our approach relies on a novel online experiment design-based procedure which focuses the exploration budget on the "directions" most relevant to learning a near-optimal policy, and may be of independent interest.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

Neural Information Processing Systems

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)

Add feedback

26d01e5ed42d8dcedd6aa0e3e99cffc4-Paper-Conference.pdf

Neural Information Processing SystemsApr-25-2026, 03:49:00 GMT

artificial intelligence, machine learning, reinforcement learning, (13 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Robots (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.73)

Add feedback

Exploration-Guided Reward Shaping for Reinforcement Learning under Sparse Rewards

Neural Information Processing SystemsApr-25-2026, 03:48:35 GMT

We study the problem of reward shaping to accelerate the training process of a reinforcement learning agent. Existing works have considered a number of different reward shaping formulations; however, they either require external domain knowledge or fail in environments with extremely sparse rewards. In this paper, we propose a novel framework, Exploration-Guided Reward Shaping (EXPLORS), that operates in a fully self-supervised manner and can accelerate an agent's learning even in sparse-reward environments. The key idea of EXPLORS is to learn an intrinsic reward function in combination with exploration-based bonuses to maximize the agent's utility w.r.t.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Country: Europe (0.46)

Genre: Research Report > New Finding (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Exploration-Guided Reward Shaping for Reinforcement Learning under Sparse Rewards

Neural Information Processing SystemsApr-25-2026, 03:48:32 GMT

We study the problem of reward shaping to accelerate the training process of a reinforcement learning agent. Existing works have considered a number of different reward shaping formulations; however, they either require external domain knowledge or fail in environments with extremely sparse rewards. In this paper, we propose a novel framework, Exploration-Guided Reward Shaping (EXPLORS), that operates in a fully self-supervised manner and can accelerate an agent's learning even in sparse-reward environments. The key idea of EXPLORS is to learn an intrinsic reward function in combination with exploration-based bonuses to maximize the agent's utility w.r.t.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Neural Information Processing Systems

Country: Europe (0.46)

Genre: Research Report > New Finding (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Flexible Option Learning

Neural Information Processing SystemsApr-25-2026, 03:48:04 GMT

Temporal abstraction in reinforcement learning (RL), offers the promise of improving generalization and knowledge transfer in complex environments, by propagating information more efficiently over time. Although option learning was initially formulated in a way that allows updating many options simultaneously, using off-policy, intra-option learning (Sutton, Precup & Singh, 1999), many of the recent hierarchical reinforcement learning approaches only update a single option at a time: the option currently executing. We revisit and extend intra-option learning in the context of deep reinforcement learning, in order to enable updating all options consistent with current primitive action choices, without introducing any additional estimates. Our method can therefore be naturally adopted in most hierarchical RL frameworks. When we combine our approach with the option-critic algorithm for option discovery, we obtain significant improvements in performance and data-efficiency across a wide variety of domains.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

Neural Information Processing Systems

Country: North America > United States > New York (0.28)

Genre: Research Report (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback