Goto

Collaborating Authors

 fmk


AThe Algorithm

Neural Information Processing Systems

Construct optimistic MDP fMk and compute optimistic policy πk (Algorithm 5). When the counter is 0 it gets (s,a), i.e., Ωi,e = (s,a,). When the counter is 1, we take (s,a) from ωn and map them to ωn/2 while eliminating half of the factors in consideration with the consistent scope Zi chosen by the policy (stored in factor 2d+ 1 + iof the state). It is handled similarly to the previous item, but considers the reward consistent scope zj chosen by the policy (stored in factor 3d+ 1 + j of the state). For i = 1,...,d, the i-th factor is taken from factor i of the previous state when the counter is not log n + 1, and otherwise performs the optimistic transition of factor i. Denote the value in the last factor of Ωi,e by ve, the policy's chosen scope by Zi (stored in factor 2d+ 1 + iof the state) and the policy's chosen next state direction by s0i (stored in factor d+ 1 + iof the state).


ARelatedWork

Neural Information Processing Systems

Incontrast,our work is concerned with an overall limit on the total amount of information an agent may acquire fromtheenvironment and,inturn,howthattranslates intoitsselection ofafeasible learning target.


Deciding WhattoModel: Value-EquivalentSampling forReinforcementLearning

Neural Information Processing Systems

Inthiswork,weconsider thescenario where agent limitations may entirely preclude identifying an exactly value-equivalent model, immediately giving rise to a trade-off between identifying a model that is simple enough to learn while only incurring bounded sub-optimality.



ARelatedWork

Neural Information Processing Systems

Incontrast,our work is concerned with an overall limit on the total amount of information an agent may acquire fromtheenvironment and,inturn,howthattranslates intoitsselection ofafeasible learning target.