Goto

Collaborating Authors

 softmax policy gradient


Cold-Start Reinforcement Learning with Softmax Policy Gradient

Neural Information Processing Systems

Policy-gradient approaches to reinforcement learning have two common and undesirable overhead procedures, namely warm-start training and sample variance reduction. In this paper, we describe a reinforcement learning method based on a softmax value function that requires neither of these procedures. Our method combines the advantages of policy-gradient methods with the efficiency and simplicity of maximum-likelihood approaches. We apply this new cold-start reinforcement learning method in training sequence generation models for structured output prediction problems.


Review for NeurIPS paper: Escaping the Gravitational Pull of Softmax

Neural Information Processing Systems

Summary and Contributions: ##Update## The rebuttal adequately addressed my main concerns and I am consequently increasing my score to a 7. In particular I was pleased that the authors investigated the issues with the learning rate, and I would be happy if they mention this potential limitation in their revisions, and include the experimental results showing that the naive adaptive learning rate proposals I made would not be effective. It was also pleasing that they will discuss and compare with Neural Replicator Dynamics, and the additional experiment with sampled actions also looks promising. The reason I didn't increase my score further was that the current set of experiments is still rather simple, and it is difficult for me to assess whether the new method is likely to be widely used. Though, I feel that the contribution may well turn out to be much more influential.


Reviews: Cold-Start Reinforcement Learning with Softmax Policy Gradient

Neural Information Processing Systems

The paper presents a new method for structured output prediction using reinforcement learning. Previous methods used reward augmented maximum likelihoods or policy gradients. The new method uses a soft-max objective. The authors present a new inference method that can be used to efficiently evaluate the integral in the objective. In addition, the authors propose to use additional reward functions which encode prior knowledge (e.g. to avoid word repetitions).


Cold-Start Reinforcement Learning with Softmax Policy Gradient

Nan Ding, Radu Soricut

Neural Information Processing Systems

Policy-gradient approaches to reinforcement learning have two common and undesirable overhead procedures, namely warm-start training and sample variance reduction. In this paper, we describe a reinforcement learning method based on a softmax value function that requires neither of these procedures. Our method combines the advantages of policy-gradient methods with the efficiency and simplicity of maximum-likelihood approaches. We apply this new cold-start reinforcement learning method in training sequence generation models for structured output prediction problems.


Interpolating Between Softmax Policy Gradient and Neural Replicator Dynamics with Capped Implicit Exploration

Morrill, Dustin, Saleh, Esra'a, Bowling, Michael, Greenwald, Amy

arXiv.org Machine Learning

Neural replicator dynamics (NeuRD) is an alternative to the foundational softmax policy gradient (SPG) algorithm motivated by online learning and evolutionary game theory. The NeuRD expected update is designed to be nearly identical to that of SPG, however, we show that the Monte Carlo updates differ in a substantial way: the importance correction accounting for a sampled action is nullified in the SPG update, but not in the NeuRD update. Naturally, this causes the NeuRD update to have higher variance than its SPG counterpart. Building on implicit exploration algorithms in the adversarial bandit setting, we introduce capped implicit exploration (CIX) estimates that allow us to construct NeuRD-CIX, which interpolates between this aspect of NeuRD and SPG. We show how CIX estimates can be used in a black-box reduction to construct bandit algorithms with regret bounds that hold with high probability and the benefits this entails for NeuRD-CIX in sequential decision-making settings. Our analysis reveals a bias--variance tradeoff between SPG and NeuRD, and shows how theory predicts that NeuRD-CIX will perform well more consistently than NeuRD while retaining NeuRD's advantages over SPG in non-stationary environments.


Cold-Start Reinforcement Learning with Softmax Policy Gradient

Ding, Nan, Soricut, Radu

Neural Information Processing Systems

Policy-gradient approaches to reinforcement learning have two common and undesirable overhead procedures, namely warm-start training and sample variance reduction. In this paper, we describe a reinforcement learning method based on a softmax value function that requires neither of these procedures. Our method combines the advantages of policy-gradient methods with the efficiency and simplicity of maximum-likelihood approaches. We apply this new cold-start reinforcement learning method in training sequence generation models for structured output prediction problems. Papers published at the Neural Information Processing Systems Conference.


Cold-Start Reinforcement Learning with Softmax Policy Gradient

Ding, Nan, Soricut, Radu

Neural Information Processing Systems

Policy-gradient approaches to reinforcement learning have two common and undesirable overhead procedures, namely warm-start training and sample variance reduction. In this paper, we describe a reinforcement learning method based on a softmax value function that requires neither of these procedures. Our method combines the advantages of policy-gradient methods with the efficiency and simplicity of maximum-likelihood approaches. We apply this new cold-start reinforcement learning method in training sequence generation models for structured output prediction problems. Empirical evidence validates this method on automatic summarization and image captioning tasks.