Near-Optimal Learning and Planning in Separated Latent MDPs
Chen, Fan, Daskalakis, Constantinos, Golowich, Noah, Rakhlin, Alexander
Reinforcement Learning (Kaelbling et al., 1996; Sutton and Barto, 2018) captures the common challenge of learning a good policy for an agent taking a sequence of actions in an unknown, dynamic environment, whose state transitions and reward emissions are influenced by the actions taken by the agent. Reinforcement learning has recently contributed to several headline results in Deep Learning, including Atari (Mnih et al., 2013), Go (Silver et al., 2016), and the development of Large Language Models (Christiano et al., 2017; Stiennon et al., 2020; Ouyang et al., 2022). This practical success has also sparked a burst of recent work on expanding its algorithmic, statistical and learning-theoretic foundations, towards bridging the gap between theoretical understanding and practical success. In general, the agent might not fully observe the state of the environment, instead having imperfect observations of its state. Such a setting is captured by the general framework of Partially Observable Markov Decision Processes (POMDPs) (Smallwood and Sondik, 1973).
Jun-12-2024
- Country:
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Genre:
- Research Report (0.81)