AITopics | softmax policy gradient

Cold-Start Reinforcement Learning with Softmax Policy Gradient

Neural Information Processing SystemsNov-21-2025, 16:18:06 GMT

Policy-gradient approaches to reinforcement learning have two common and undesirable overhead procedures, namely warm-start training and sample variance reduction. In this paper, we describe a reinforcement learning method based on a softmax value function that requires neither of these procedures. Our method combines the advantages of policy-gradient methods with the efficiency and simplicity of maximum-likelihood approaches. We apply this new cold-start reinforcement learning method in training sequence generation models for structured output prediction problems.

cold-start reinforcement learning, name change, softmax policy gradient, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Review for NeurIPS paper: Escaping the Gravitational Pull of Softmax

Neural Information Processing SystemsFeb-8-2025, 02:57:07 GMT

Summary and Contributions: ##Update## The rebuttal adequately addressed my main concerns and I am consequently increasing my score to a 7. In particular I was pleased that the authors investigated the issues with the learning rate, and I would be happy if they mention this potential limitation in their revisions, and include the experimental results showing that the naive adaptive learning rate proposals I made would not be effective. It was also pleasing that they will discuss and compare with Neural Replicator Dynamics, and the additional experiment with sampled actions also looks promising. The reason I didn't increase my score further was that the current set of experiments is still rather simple, and it is difficult for me to assess whether the new method is likely to be widely used. Though, I feel that the contribution may well turn out to be much more influential.

escape time, policy gradient, softmax policy gradient, (13 more...)

Neural Information Processing Systems

Genre: Research Report (0.36)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.52)

Add feedback

Reviews: Cold-Start Reinforcement Learning with Softmax Policy Gradient

Neural Information Processing SystemsOct-8-2024, 13:47:46 GMT

The paper presents a new method for structured output prediction using reinforcement learning. Previous methods used reward augmented maximum likelihoods or policy gradients. The new method uses a soft-max objective. The authors present a new inference method that can be used to efficiently evaluate the integral in the objective. In addition, the authors propose to use additional reward functions which encode prior knowledge (e.g. to avoid word repetitions).

algorithm, objective, policy search, (11 more...)

Neural Information Processing Systems

Genre: Summary/Review (0.37)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.76)

Add feedback

Cold-Start Reinforcement Learning with Softmax Policy Gradient

Nan Ding, Radu Soricut

Neural Information Processing SystemsOct-4-2024, 11:08:17 GMT

Policy-gradient approaches to reinforcement learning have two common and undesirable overhead procedures, namely warm-start training and sample variance reduction. In this paper, we describe a reinforcement learning method based on a softmax value function that requires neither of these procedures. Our method combines the advantages of policy-gradient methods with the efficiency and simplicity of maximum-likelihood approaches. We apply this new cold-start reinforcement learning method in training sequence generation models for structured output prediction problems.

reinforcement, sequence, value function, (16 more...)

Neural Information Processing Systems

Country: North America > United States > California > Los Angeles County > Long Beach (0.04)

Genre: Research Report (0.46)

Add feedback

Interpolating Between Softmax Policy Gradient and Neural Replicator Dynamics with Capped Implicit Exploration

Morrill, Dustin, Saleh, Esra'a, Bowling, Michael, Greenwald, Amy

arXiv.org Machine LearningJun-4-2022

Neural replicator dynamics (NeuRD) is an alternative to the foundational softmax policy gradient (SPG) algorithm motivated by online learning and evolutionary game theory. The NeuRD expected update is designed to be nearly identical to that of SPG, however, we show that the Monte Carlo updates differ in a substantial way: the importance correction accounting for a sampled action is nullified in the SPG update, but not in the NeuRD update. Naturally, this causes the NeuRD update to have higher variance than its SPG counterpart. Building on implicit exploration algorithms in the adversarial bandit setting, we introduce capped implicit exploration (CIX) estimates that allow us to construct NeuRD-CIX, which interpolates between this aspect of NeuRD and SPG. We show how CIX estimates can be used in a black-box reduction to construct bandit algorithms with regret bounds that hold with high probability and the benefits this entails for NeuRD-CIX in sequential decision-making settings. Our analysis reveals a bias--variance tradeoff between SPG and NeuRD, and shows how theory predicts that NeuRD-CIX will perform well more consistently than NeuRD while retaining NeuRD's advantages over SPG in non-stationary environments.

artificial intelligence, gradient and neural replicator dynamic, machine learning, (3 more...)

arXiv.org Machine Learning

2206.02036

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.60)

Add feedback

Cold-Start Reinforcement Learning with Softmax Policy Gradient

Ding, Nan, Soricut, Radu

Neural Information Processing SystemsFeb-14-2020, 11:26:54 GMT

Policy-gradient approaches to reinforcement learning have two common and undesirable overhead procedures, namely warm-start training and sample variance reduction. In this paper, we describe a reinforcement learning method based on a softmax value function that requires neither of these procedures. Our method combines the advantages of policy-gradient methods with the efficiency and simplicity of maximum-likelihood approaches. We apply this new cold-start reinforcement learning method in training sequence generation models for structured output prediction problems. Papers published at the Neural Information Processing Systems Conference.

cold-start reinforcement learning, procedure, softmax policy gradient

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Cold-Start Reinforcement Learning with Softmax Policy Gradient

Ding, Nan, Soricut, Radu

Neural Information Processing SystemsDec-31-2017

Policy-gradient approaches to reinforcement learning have two common and undesirable overhead procedures, namely warm-start training and sample variance reduction. In this paper, we describe a reinforcement learning method based on a softmax value function that requires neither of these procedures. Our method combines the advantages of policy-gradient methods with the efficiency and simplicity of maximum-likelihood approaches. We apply this new cold-start reinforcement learning method in training sequence generation models for structured output prediction problems. Empirical evidence validates this method on automatic summarization and image captioning tasks.

Add feedback

Filters

Collaborating Authors

softmax policy gradient

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Cold-Start Reinforcement Learning with Softmax Policy Gradient

Review for NeurIPS paper: Escaping the Gravitational Pull of Softmax

Reviews: Cold-Start Reinforcement Learning with Softmax Policy Gradient

Cold-Start Reinforcement Learning with Softmax Policy Gradient

Interpolating Between Softmax Policy Gradient and Neural Replicator Dynamics with Capped Implicit Exploration

Cold-Start Reinforcement Learning with Softmax Policy Gradient

Cold-Start Reinforcement Learning with Softmax Policy Gradient