AITopics | expert policy

Inverse Q-Learning Done Right: Offline Imitation Learning in Qπ-Realizable MDPs

Neural Information Processing SystemsJun-23-2026, 00:54:35 GMT

We study the problem of offline imitation learning in Markov decision processes (MDPs), where the goal is to learn a well-performing policy given a dataset of state-action pairs generated by an expert policy. Complementing a recent line of work on this topic that assumes the expert belongs to a tractable class of known policies, we approach this problem from a new angle and leverage a different type of structural assumption about the environment. Specifically, for the class of linear Qπ-realizable MDPs, we introduce a new algorithm called saddle-point offline imitation learning (SPOIL), which is guaranteed to match the performance of any expert up to an additive error ε with access to O(ε 2) samples. Moreover, we extend this result to possibly nonlinear Qπ-realizable MDPs at the cost of a worse sample complexity of order O(ε 4). Finally, our analysis suggests a new loss function for training critic networks from expert data in deep imitation learning. Empirical evaluations on standard benchmarks demonstrate that the neural net implementation of SPOIL is superior to behavior cloning and competitive with state-of-the-art algorithms.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)

Add feedback

d8f9cd02831dac9717c1c00329a49cd0-Paper-Conference.pdf

Neural Information Processing SystemsJun-22-2026, 22:27:39 GMT

A person jogs wiclusterth rig.hFinallyt head, as lifshot.

artificial intelligence, justification, machine learning, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Robots (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

On Feasible Rewards in Multi-agent Inverse Reinforcement Learning

Neural Information Processing SystemsJun-21-2026, 07:33:02 GMT

Multi-agent Inverse Reinforcement Learning (MAIRL) aims to recover agent reward functions from expert demonstrations. We characterize the feasible reward set in Markov games, identifying all reward functions that rationalize a given equilibrium. However, equilibrium-based observations are often ambiguous: a single Nash equilibrium can correspond to many reward structures, potentially changing the game's nature in multi-agent systems. We address this by introducing entropyregularized Markov games, which yield a unique equilibrium while preserving strategic incentives. For this setting, we provide a sample complexity analysis detailing how errors affect learned policy performance. Our work establishes theoretical foundations and practical insights for MAIRL.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.93)

Genre: Research Report > Experimental Study (1.00)

Industry: Leisure & Entertainment > Games (0.92)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Minimax Optimal Online Imitation Learning via Replay Estimation

Neural Information Processing SystemsApr-25-2026, 07:43:36 GMT

Online imitation learning is the problem of how best to mimic expert demonstrations, given access to the environment or an accurate simulator. Prior work has shown that in the infinite sample regime, exact moment matching achieves value equivalence to the expert policy. However, in the finite sample regime, even if one has no optimization error, empirical variance can lead to a performance gap that scales with H2/Nexp for behavioral cloning and H/ p Nexp for online moment matching, where H is the horizon and Nexp is the size of the expert dataset. We introduce the technique of replay estimation to reduce this empirical variance: by repeatedly executing cached expert actions in a stochastic simulator, we compute a smoother expert visitation distribution estimate to match. In the presence of parametric function approximation, we prove a meta theorem reducing the performance gap of our approach to the parameter estimation error for offline classification (i.e.

artificial intelligence, machine learning, nexp, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.68)

Genre: Instructional Material > Online (0.61)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.35)

Add feedback

26d01e5ed42d8dcedd6aa0e3e99cffc4-Paper-Conference.pdf

Neural Information Processing SystemsApr-25-2026, 03:49:00 GMT

artificial intelligence, machine learning, reinforcement learning, (13 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Robots (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.73)

Add feedback

Learning Shared Safety Constraints from Multi-task Demonstrations

Neural Information Processing SystemsApr-25-2026, 02:05:38 GMT

Regardless of the particular task we want them to perform in an environment, there are often shared safety constraints we want our agents to respect. For example, regardless of whether it is making a sandwich or clearing the table, a kitchen robot should not break a plate. Manually specifying such a constraint can be both time-consuming and error-prone. We show how to learn constraints from expert demonstrations of safe task completion by extending inverse reinforcement learning (IRL) techniques to the space of constraints. Intuitively, we learn constraints that forbid highly rewarding behavior that the expert could have taken but chose not to. Unfortunately, the constraint learning problem is rather ill-posed and typically leads to overly conservative constraints that forbid all behavior that the expert did not take. We counter this by leveraging diverse demonstrations that naturally occur in multi-task settings to learn a tighter set of constraints.

constraint, machine learning, reinforcement learning, (14 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Technology: