Best Policy Identification in Linear MDPs

Taupin, Jerome, Jedra, Yassir, Proutiere, Alexandre

Aug-11-2022–arXiv.org Artificial Intelligence

We investigate the problem of best policy identification in discounted linear Markov Decision Processes in the fixed confidence setting under a generative model. We first derive an instance-specific lower bound on the expected number of samples required to identify an $\varepsilon$-optimal policy with probability $1-\delta$. The lower bound characterizes the optimal sampling rule as the solution of an intricate non-convex optimization program, but can be used as the starting point to devise simple and near-optimal sampling rules and algorithms. We devise such algorithms. One of these exhibits a sample complexity upper bounded by ${\cal O}({\frac{d}{(\varepsilon+\Delta)^2}} (\log(\frac{1}{\delta})+d))$ where $\Delta$ denotes the minimum reward gap of sub-optimal actions and $d$ is the dimension of the feature space. This upper bound holds in the moderate-confidence regime (i.e., for all $\delta$), and matches existing minimax and gap-dependent lower bounds. We extend our algorithm to episodic linear MDPs.

algorithm, probability, sample complexity, (14 more...)

arXiv.org Artificial Intelligence

Aug-11-2022

arXiv.org PDF

Add feedback

Country:
- Europe > United Kingdom > England
  - Cambridgeshire > Cambridge (0.04)
  - Greater London > London (0.04)

Genre:
- Research Report (0.50)
- Workflow (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Machine Learning > Learning Graphical Models
    - Undirected Networks > Markov Models (0.48)