Goto

Collaborating Authors

 jinetal


Appendix

Neural Information Processing Systems

This is only for the ease of visualization. For linear MDP,In the generative model setting, Agarwal et al. [2020] shows model-based approach is still minimax optimal O((1 γ) 3SA/2)byusing as-absorbing MDP construction andthismodelbased technique is later reused for other more general settings (e.g. Itrequires high probability guarantee for learning optimal policyforany reward function, which is strictly stronger than the standard learning task that one only needs to learn to optimal policy for a fixed reward. B.2 GeneralabsorbingMDP The general absorbing MDP is defined as follows: for a fixed states and a sequence {ut}Ht=1, MDPMs,{ut}Ht=1 is identical toM for all states excepts, and state s is absorbing in the sense PMs,{ut}Ht=1(s|s,a) = 1 for all a, and the instantaneous reward at timet is rt(s,a) = ut for all a A. Also,weusetheshorthand notationVπ{s,ut} forVπs,Ms,{u We focus on the first claim. Later we shall remove the conditional onN (see SectionB.7). We use the singleton-absorbing MDPMs,{u?t}Ht=1 to handle the case (recallu?t







OnReward-FreeReinforcementLearningwith LinearFunctionApproximation

Neural Information Processing Systems

During the exploration phase, an agent collects samples without using a pre-specified reward function. After the exploration phase, a reward function is given, and the agent uses samples collected during the exploration phase to computeanear-optimalpolicy.


OnReward-FreeReinforcementLearningwith LinearFunctionApproximation

Neural Information Processing Systems

During the exploration phase, an agent collects samples without using a pre-specified reward function. After the exploration phase, a reward function is given, and the agent uses samples collected during the exploration phase to computeanear-optimalpolicy.



ProvablyEfficientReinforcementLearningwith LinearFunctionApproximationunderAdaptivity Constraints

Neural Information Processing Systems

Real-world reinforcement learning (RL) applications often come with possibly infinite state and action space, and in such a situation classical RL algorithms developed in the tabular setting are not applicable anymore. A popular approach to overcoming this issue is by applying function approximation techniques to the underlying structures of the Markovdecision processes (MDPs).