ucbexplore
- North America > United States (0.14)
- North America > Canada (0.04)
- North America > United States (0.14)
- North America > Canada (0.04)
On a high level, both approaches build accurate estimates
We thank the reviewers for their comments and insightful reviews. 's is only logarithmic as the main dependency is w.r.t. VI algorithm for SSP was proved in [37] to converge in time quadratic w.r.t. the size of the considered state space This allows tuning the parameter online according to the desired behavior. A sketch of the proof of Thm. 1 is currently available in App. B. In case of acceptance we will use We will include additional experiments for varying L in the final version.
- North America > United States (0.28)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- North America > United States (0.14)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Improved Sample Complexity for Incremental Autonomous Exploration in MDPs
Tarbouriech, Jean, Pirotta, Matteo, Valko, Michal, Lazaric, Alessandro
We investigate the exploration of an unknown environment when no reward function is provided. Building on the incremental exploration setting introduced by Lim and Auer [1], we define the objective of learning the set of $\epsilon$-optimal goal-conditioned policies attaining all states that are incrementally reachable within $L$ steps (in expectation) from a reference state $s_0$. In this paper, we introduce a novel model-based approach that interleaves discovering new states from $s_0$ and improving the accuracy of a model estimate that is used to compute goal-conditioned policies to reach newly discovered states. The resulting algorithm, DisCo, achieves a sample complexity scaling as $\tilde{O}(L^5 S_{L+\epsilon} \Gamma_{L+\epsilon} A \epsilon^{-2})$, where $A$ is the number of actions, $S_{L+\epsilon}$ is the number of states that are incrementally reachable from $s_0$ in $L+\epsilon$ steps, and $\Gamma_{L+\epsilon}$ is the branching factor of the dynamics over such states. This improves over the algorithm proposed in [1] in both $\epsilon$ and $L$ at the cost of an extra $\Gamma_{L+\epsilon}$ factor, which is small in most environments of interest. Furthermore, DisCo is the first algorithm that can return an $\epsilon/c_{\min}$-optimal policy for any cost-sensitive shortest-path problem defined on the $L$-reachable states with minimum cost $c_{\min}$. Finally, we report preliminary empirical results confirming our theoretical findings.
- North America > United States (0.28)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Autonomous exploration for navigating in non-stationary CMPs
Gajane, Pratik, Ortner, Ronald, Auer, Peter, Szepesvari, Csaba
We consider a setting in which the objective is to learn to navigate in a controlled Markov process (CMP) where transition probabilities may abruptly change. For this setting, we propose a performance measure called exploration steps which counts the time steps at which the learner lacks sufficient knowledge to navigate its environment efficiently. We devise a learning meta-algorithm, MNM, and prove an upper bound on the exploration steps in terms of the number of changes.