Near-Optimal Learning and Planning in Separated Latent MDPs

Chen, Fan, Daskalakis, Constantinos, Golowich, Noah, Rakhlin, Alexander

arXiv.org Machine Learning 

Reinforcement Learning (Kaelbling et al., 1996; Sutton and Barto, 2018) captures the common challenge of learning a good policy for an agent taking a sequence of actions in an unknown, dynamic environment, whose state transitions and reward emissions are influenced by the actions taken by the agent. Reinforcement learning has recently contributed to several headline results in Deep Learning, including Atari (Mnih et al., 2013), Go (Silver et al., 2016), and the development of Large Language Models (Christiano et al., 2017; Stiennon et al., 2020; Ouyang et al., 2022). This practical success has also sparked a burst of recent work on expanding its algorithmic, statistical and learning-theoretic foundations, towards bridging the gap between theoretical understanding and practical success. In general, the agent might not fully observe the state of the environment, instead having imperfect observations of its state. Such a setting is captured by the general framework of Partially Observable Markov Decision Processes (POMDPs) (Smallwood and Sondik, 1973).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found