Reinforcement Learning
Outcome-DrivenReinforcementLearningvia VariationalInference
Standard reinforcement learning (RL) addresses reward maximization in a Markov decision process (MDP) defined by the tuple(S,A,pS0,pd,r,γ) [43, 44], where S and A denote the state and action space, respectively,p0 denotes the initial state distribution,pd is a state transition distribution, r is an immediate reward function, andγ is a discount factor. To sample trajectories, an initial state is sampled according topS0, and successive states are sampled from the state transition distributionSt+1 pd( |st,at) and actions from a policyAt π( |st).