AITopics | adversarial mdp

State-free Reinforcement Learning

Neural Information Processing SystemsFeb-18-2026, 07:20:45 GMT

Reinforcement learning (RL) studies the problem where an agent interacts with an unknown environment to optimize cumulative rewards/losses [Sutton and Barto, 2018].

artificial intelligence, machine learning, reinforcement learning, (14 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Greater London > London (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Near-OptimalRegretforAdversarialMDPwith DelayedBanditFeedback

Neural Information Processing SystemsFeb-12-2026, 05:29:41 GMT

The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observedindelay.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Nevada (0.04)
North America > Canada > Quebec > Montreal (0.04)
(2 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.34)

Add feedback

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

Neural Information Processing SystemsDec-25-2025, 10:50:31 GMT

The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode $k$ is revealed only in the end of episode $k + d^k$, where the delay $d^k$ can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal $\sqrt{K + D}$ regret, where $K$ is the number of episodes and $D = \sum_{k=1}^K d^k$ is the total delay, significantly improving upon the best known regret bound of $(K + D)^{2/3}$.

adversarial mdp, name change, near-optimal regret, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses

Neural Information Processing SystemsDec-24-2025, 20:52:22 GMT

Policy optimization is a widely-used method in reinforcement learning. Due to its local-search nature, however, theoretical guarantees on global optimality often rely on extra assumptions on the Markov Decision Processes (MDPs) that bypass the challenge of global exploration. To eliminate the need of such assumptions, in this work, we develop a general solution that adds dilated bonuses to the policy update to facilitate global exploration. To showcase the power and generality of this technique, we apply it to several episodic MDP settings with adversarial losses and bandit feedback, improving and generalizing the state-of-the-art.

adversarial mdp, improved exploration, policy optimization, (11 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.59)

Add feedback

State-free Reinforcement Learning

Neural Information Processing SystemsOct-10-2025, 17:53:20 GMT

Reinforcement learning (RL) studies the problem where an agent interacts with an unknown environment to optimize cumulative rewards/losses [Sutton and Barto, 2018].

algorithm, neural information processing system, suffice, (13 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Greater London > London (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

Neural Information Processing SystemsAug-19-2025, 08:00:09 GMT

The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.14)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Asia > Middle East > Jordan (0.04)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.45)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.45)

Add feedback

d850b7e0cdc7f1c0820c6ad85405ae94-Paper-Conference.pdf

Neural Information Processing SystemsAug-19-2025, 08:00:05 GMT

artificial intelligence, machine learning, reinforcement learning, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.14)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.05)
Asia > Middle East > Jordan (0.04)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Robots (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.46)

Add feedback

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback

Neural Information Processing SystemsJan-19-2025, 01:04:21 GMT

The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode k is revealed only in the end of episode k d k, where the delay d k can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal \sqrt{K D} regret, where K is the number of episodes and D \sum_{k 1} K d k is the total delay, significantly improving upon the best known regret bound of (K D) {2/3} .

adversarial mdp, delayed bandit feedback, near-optimal regret, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.30)

Add feedback

Policy Optimization in Adversarial MDPs: Improved Exploration via Dilated Bonuses

Neural Information Processing SystemsJan-19-2025, 00:33:37 GMT

Policy optimization is a widely-used method in reinforcement learning. Due to its local-search nature, however, theoretical guarantees on global optimality often rely on extra assumptions on the Markov Decision Processes (MDPs) that bypass the challenge of global exploration. To eliminate the need of such assumptions, in this work, we develop a general solution that adds dilated bonuses to the policy update to facilitate global exploration. To showcase the power and generality of this technique, we apply it to several episodic MDP settings with adversarial losses and bandit feedback, improving and generalizing the state-of-the-art. When the number of states is infinite, under the assumption that the state-action values are linear in some low-dimensional features, we obtain \widetilde{\mathcal{O}}({T} {\frac{2}{3}}) regret with the help of a simulator, matching the result of Neu and Olkhovskaya [2020] while importantly removing the need of an exploratory policy that their algorithm requires.

assumption, improved exploration, policy optimization, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.61)

Add feedback

State-free Reinforcement Learning

Chen, Mingyu, Pacchiano, Aldo, Zhang, Xuezhou

arXiv.org Artificial IntelligenceSep-27-2024

In this work, we study the \textit{state-free RL} problem, where the algorithm does not have the states information before interacting with the environment. Specifically, denote the reachable state set by ${S}^\Pi := \{ s|\max_{\pi\in \Pi}q^{P, \pi}(s)>0 \}$, we design an algorithm which requires no information on the state space $S$ while having a regret that is completely independent of ${S}$ and only depend on ${S}^\Pi$. We view this as a concrete first step towards \textit{parameter-free RL}, with the goal of designing RL algorithms that require no hyper-parameter tuning.

algorithm, neural information processing system, suffice, (13 more...)

arXiv.org Artificial Intelligence

2409.18439

Country: