AITopics | exploration policy

Deployment Efficient Reward-Free Exploration with Linear Function Approximation

Neural Information Processing SystemsJun-15-2026, 16:39:21 GMT

We study deployment-efficient reward-free exploration with linear function approximation, where the goal is to explore a linear Markov Decision Process (MDP) without revealing the reward function, while minimizing the number of distinct policies implemented during learning. By "deployment efficient", we mean algorithms that require few policies deployed during exploration - crucial in real-world applications where such deployments are costly or disruptive. We design a novel reinforcement learning algorithm that achieves near-optimal deployment efficiency for linear MDPs in the reward-free setting, using at most H exploration policies during execution (where H is the horizon length), while maintaining sample complexity polynomial in feature dimension and horizon length. Unlike previous approaches with similar deployment efficiency guarantees, our algorithm's sample complexity is independent of the reachability or explorability coefficients of the underlying MDP, which can be arbitrarily small and lead to unbounded sample complexity in certain cases - directly addressing an open problem from prior work. Our technical contributions include a data-dependent method for truncating stateaction pairs in linear MDPs, efficient offline policy evaluation and optimization algorithms for these truncated MDPs, and a careful integration of these components to implement reward-free exploration with linear function approximation without sacrificing deployment efficiency.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

fd489a44f3bcb9f122e4931ef21d0c43-Paper-Conference.pdf

Neural Information Processing SystemsMay-1-2026, 05:15:49 GMT

machine learning, reinforcement learning, task representation, (12 more...)

Neural Information Processing Systems

Country: Asia > China (0.47)

Genre: Research Report (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

26d01e5ed42d8dcedd6aa0e3e99cffc4-Paper-Conference.pdf

Neural Information Processing SystemsApr-25-2026, 03:49:00 GMT

artificial intelligence, machine learning, reinforcement learning, (13 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Robots (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.73)

Add feedback

Sustainable Online Reinforcement Learning for Auto-bidding

Neural Information Processing SystemsApr-24-2026, 16:10:19 GMT

Recently, auto-bidding technique has become an essential tool to increase the revenue of advertisers. Facing the complex and ever-changing bidding environments in the real-world advertising system (RAS), state-of-the-art auto-bidding policies usually leverage reinforcement learning (RL) algorithms to generate realtime bids on behalf of the advertisers. Due to safety concerns, it was believed that the RL training process can only be carried out in an offline virtual advertising system (VAS) that is built based on the historical data generated in the RAS. In this paper, we argue that there exists significant gaps between the VAS and RAS, making the RL training process suffer from the problem of inconsistency between online and offline (IBOO). Firstly, we formally define the IBOO and systematically analyze its causes and influences. Then, to avoid the IBOO, we propose a sustainable online RL (SORL) framework that trains the auto-bidding policy by directly interacting with the RAS, instead of learning in the VAS. Specifically, based on our proof of the Lipschitz smooth property of the Q function, we design a safe and efficient online exploration (SER) policy for continuously collecting data from the RAS. Meanwhile, we derive the theoretical lower bound on the safety degree of the SER policy. We also develop a variance-suppressed conservative Q-learning (V-CQL) method to effectively and stably learn the auto-bidding policy with the collected data.

auto-bidding policy, machine learning, reinforcement learning, (18 more...)

Neural Information Processing Systems

Genre:

Research Report (0.46)
Instructional Material > Online (0.41)

Industry: Information Technology > Services (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Context Shift Reduction for Offline Meta-Reinforcement Learning Y unkai Gao

Neural Information Processing SystemsFeb-18-2026, 03:21:14 GMT

Offline meta-reinforcement learning (OMRL) utilizes pre-collected offline datasets to enhance the agent's generalization ability on unseen tasks.

machine learning, reinforcement learning, task representation, (12 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Anhui Province > Hefei (0.04)

Genre: Research Report (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Overcoming the Sim-to-Real Gap: Leveraging Simulation to Learn to Explore for Real-World RL Andrew Wagenmaker

Neural Information Processing SystemsFeb-16-2026, 14:41:25 GMT

Such direct sim2real transfer is not guaranteed to succeed, however, and in cases where it fails, it is unclear how to best utilize the simulator. In this work, we show that in many regimes, while direct sim2real transfer may fail, we can utilize the simulator to learn a set of exploratory policies which enable efficient exploration in the real world.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Country: