Nota, Chris
On the Convergence of Discounted Policy Gradient Methods
Nota, Chris
Policy gradient methods are a class of reinforcement learning (RL) algorithms that attempt to directly maximize the expected performance of an agent's policy by following the gradient of an objective function (Sutton et al., 2000), typically the expected sum of rewards, using a stochastic estimator generated by interacting with the environment. Unbiased estimators of this gradient can suffer from high variance due to high variance in the sum of future rewards. A common approach is to instead consider an exponentially discounted sum of future rewards. This approach reduces the variance of most estimators but introduces bias (Thomas, 2014). Frequently, the discounted sum of future rewards is estimated by a critic (Konda and Tsitsiklis, 2000). It has been argued that when a critic is used, discounting has the additional benefit of reducing approximation error (Zhang et al., 2020). The "discounted" policy gradient was originally introduced as the gradient of a discounted objective (Sutton et al., 2000). However, it has been shown that the gradient of the discounted objective does not produce the update direction followed by most discounted policy gradient algorithms (Thomas, 2014; Nota and Thomas, 2019).
Is the Policy Gradient a Gradient?
Nota, Chris, Thomas, Philip S.
The policy gradient theorem describes the gradient of the expected discounted return with respect to an agent's policy parameters. However, most policy gradient methods do not use the discount factor in the manner originally prescribed, and therefore do not optimize the discounted objective. It has been an open question in RL as to which, if any, objective they optimize instead. We show that the direction followed by these methods is not the gradient of any objective, and reclassify them as semi-gradient methods with respect to the undiscounted objective. Further, we show that they are not guaranteed to converge to a locally optimal policy, and construct an counterexample where they will converge to the globally pessimal policy with respect to both the discounted and undiscounted objectives.
Lifelong Learning with a Changing Action Set
Chandak, Yash, Theocharous, Georgios, Nota, Chris, Thomas, Philip S.
In many real-world sequential decision making problems, the number of available actions (decisions) can vary over time. While problems like catastrophic forgetting, changing transition dynamics, changing rewards functions, etc. have been well-studied in the lifelong learning literature, the setting where the action set changes remains unaddressed. In this paper, we present an algorithm that autonomously adapts to an action set whose size changes over time. To tackle this open problem, we break it into two problems that can be solved iteratively: inferring the underlying, unknown, structure in the space of actions and optimizing a policy that leverages this structure. We demonstrate the efficiency of this approach on large-scale real-world lifelong learning problems.
Reinforcement Learning Without Backpropagation or a Clock
Kostas, James, Nota, Chris, Thomas, Philip S.
Reinforcement learning (RL) algorithms share qualitative similarities with the algorithms implemented byanimal brains. However, there remain clear differences between these two types of algorithms. For example, while RL algorithms using artificial neural networks require information to flow backwards through the network via the backpropagation algorithm, there is currently debate about whether this is feasible in biological neural implementations (Werbos and Davis, 2016). Policy gradient coagent networks (PGCNs) are a class of RL algorithms that were introduced to remove this possibly biologically implausible property of RL algorithms--they use artificial neural networks but do not use the backpropagation algorithm (Thomas, 2011). Since their introduction, PGCN algorithms have proven to be not only a possible improvement in biological plausibility, but a practical tool for improving RL agents.