Goto

Collaborating Authors

 deterministic mdp





Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs

Neural Information Processing Systems

In probably approximately correct (PAC) reinforcement learning (RL), an agent is required to identify an $\epsilon$-optimal policy with probability $1-\delta$. While minimax optimal algorithms exist for this problem, its instance-dependent complexity remains elusive in episodic Markov decision processes (MDPs). In this paper, we propose the first nearly matching (up to a horizon squared factor and logarithmic terms) upper and lower bounds on the sample complexity of PAC RL in deterministic episodic MDPs with finite state and action spaces. In particular, our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap. While our instance-dependent lower bound is written as a linear program, our algorithms are very simple and do not require solving such an optimization problem during learning. Their design and analyses employ novel ideas, including graph-theoretical concepts (minimum flows) and a new maximum-coverage exploration strategy.


Why and How Auxiliary Tasks Improve JEPA Representations

arXiv.org Artificial Intelligence

Joint-Embedding Predictive Architecture (JEPA) is increasingly used for visual representation learning and as a component in model-based RL, but its behavior remains poorly understood. We provide a theoretical characterization of a simple, practical JEPA variant that has an auxiliary regression head trained jointly with latent dynamics. We prove a No Unhealthy Representation Collapse theorem: in deterministic MDPs, if training drives both the latent-transition consistency loss and the auxiliary regression loss to zero, then any pair of non-equivalent observations, i.e., those that do not have the same transition dynamics or auxiliary value, must map to distinct latent representations. Thus, the auxiliary task anchors which distinctions the representation must preserve. Controlled ablations in a counting environment corroborate the theory and show that training the JEPA model jointly with the auxiliary head generates a richer representation than training them separately. Our work indicates a path to improve JEPA encoders: training them with an auxiliary function that, together with the transition dynamics, encodes the right equivalence relations.



Efficient Computation of Blackwell Optimal Policies using Rational Functions

arXiv.org Artificial Intelligence

Markov Decision Problems (MDPs) provide a founda-tional framework for modelling sequential decision-making across diverse domains, guided by optimality criteria such as discounted and average rewards. However, these criteria have inherent limitations: discounted optimality may overly prioritise short-term rewards, while average optimality relies on strong structural assumptions. Blackwell optimality addresses these challenges, offering a robust and comprehensive criterion that ensures optimality under both discounted and average reward frameworks. Despite its theoretical appeal, existing algorithms for computing Blackwell Optimal (BO) policies are computationally expensive or hard to implement. In this paper we describe procedures for computing BO policies using an ordering of rational functions in the vicinity of 1 . We adapt state-of-the-art algorithms for deterministic and general MDPs, replacing numerical evaluations with symbolic operations on rational functions to derive bounds independent of bit complexity. For deterministic MDPs, we give the first strongly polynomial-time algorithms for computing BO policies, and for general MDPs we obtain the first subexponential-time algorithm. We further generalise several policy iteration algorithms, extending the best known upper bounds from the discounted to the Blackwell criterion.



f5f3b8d720f34ebebceb7765e447268b-AuthorFeedback.pdf

Neural Information Processing Systems

We thank all reviewers for detailed and valuable comments, and will revise the paper accordingly as described below. We thank all reviewers for pointing those out, and will do corrections in the revision. We agree with the reviewer and will change the wording in the revision. HIRO paper, goal-conditioned HRL often yields better performance than HRL with Options. E.g. all graph-based works cited in the review obtain the subgoal sequence by solving a shortest-path In the revision, we will add these discussions to the related work section.


9 Appendix For all the following derivations, we use D

Neural Information Processing Systems

Based on Lemma2, we can derive the upper-bound of our original objective: Theorem 1 (Surrogate Objective as the Divergence Upper-bound) . We provide proof with a counter-example. Based on Assumption 1, we have the following: Corollary 1. A similar strategy is adopted by [3]. Sec 3.2 to learn a value function v (s,s