Constrained Exploration in Reinforcement Learning with Optimality Preservation

Chen, Peter C. Y.

arXiv.org Artificial Intelligence 

In reinforcement learning, exploration refers to the agent taking actions according to a behavior policy in order to traverse a typically discrete state space and collect rewards. While exploring the state space, the agent uses an update rule to estimate, based on the rewards collected, the Q-values (i.e., state-action values) from one iteration to the next. If the Q-values converge to their optimums, an optimal policy can then be obtained. For a class of reinforcement learning problems, such convergence is guaranteed under the Robbins-Monro conditions [47]. A requirement for satisfying the Robbins-Monro conditions is that every state-action pair must have a non-zero probability of being visited by the agent -- also known as persistent exploration. If we consider the agent taking an action (when it is at a state) as'generating' a symbol denoting that action, the sequences of actions thus generated by the agent as it traverses through the states represent the behavior of the agent. For an episodic learning process, the behavior of the agent consists of all possible action sequences from the initial state to the set of goal states. We refer to such a process as an unconstrained learning process, and the associated optimal Q-values as the intrinsic optimums.