AITopics | kl-control

Collaborating Authors

kl-control

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

67496dfa96afddab795530cc7c69b57a-Supplemental-Conference.pdf

Neural Information Processing SystemsFeb-9-2026, 12:55:50 GMT

Theoptimalbaseline, however, israrelyusedinpractice (Sutton & Barto (2018); foran exception, see (Peters & Schaal, 2008)). Equation (1) thentakesthefollowingform: r E R(x)= E (R(x) B)r log (x).

artificial intelligence, kl-control, machine learning, (15 more...)

Neural Information Processing Systems

Country:

Asia > Myanmar (0.04)
Oceania > Fiji > Western Division > Lautoka (0.04)
Oceania > Australia (0.04)
(17 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

67496dfa96afddab795530cc7c69b57a-Supplemental-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 15:57:39 GMT

baseline, higher corpus level diversity, intensity, (13 more...)

Neural Information Processing Systems

Country:

Asia > Myanmar (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > Middle East > Israel (0.04)
(37 more...)

Genre:

Personal (1.00)
Research Report (0.67)

Industry:

Leisure & Entertainment > Sports > Soccer (1.00)
Leisure & Entertainment > Sports > Football (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Security & Privacy (0.70)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting

Korbak, Tomasz, Elsahar, Hady, Kruszewski, Germán, Dymetman, Marc

arXiv.org Artificial IntelligenceNov-14-2022

The availability of large pre-trained models is changing the landscape of Machine Learning research and practice, moving from a training-from-scratch to a fine-tuning paradigm. While in some applications the goal is to "nudge" the pre-trained distribution towards preferred outputs, in others it is to steer it towards a different distribution over the sample space. Two main paradigms have emerged to tackle this challenge: Reward Maximization (RM) and, more recently, Distribution Matching (DM). RM applies standard Reinforcement Learning (RL) techniques, such as Policy Gradients, to gradually increase the reward signal. DM prescribes to first make explicit the target distribution that the model is fine-tuned to approximate. Here we explore the theoretical connections between the two paradigms, and show that methods such as KL-control developed for RM can also be construed as belonging to DM. We further observe that while DM differs from RM, it can suffer from similar training difficulties, such as high gradient variance. We leverage connections between the two paradigms to import the concept of baseline into DM methods. We empirically validate the benefits of adding a baseline on an array of controllable language generation tasks such as constraining topic, sentiment, and gender distributions in texts sampled from a language model. We observe superior performance in terms of constraint satisfaction, stability and sample efficiency.

machine learning, natural language, reinforcement learning, (19 more...)

arXiv.org Artificial Intelligence

2206.00761

Country:

North America > United States > Texas > Travis County > Austin (0.14)
Asia > Myanmar (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
(53 more...)

Genre:

Personal (1.00)
Research Report > New Finding (0.45)
Instructional Material > Course Syllabus & Notes (0.45)

Industry:

Leisure & Entertainment > Sports > Soccer (1.00)
Leisure & Entertainment > Sports > Football (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
(2 more...)

Add feedback

Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

Jaques, Natasha, Ghandeharioun, Asma, Shen, Judy Hanwen, Ferguson, Craig, Lapedriza, Agata, Jones, Noah, Gu, Shixiang, Picard, Rosalind

arXiv.org Artificial IntelligenceJul-8-2019

Most deep reinforcement learning (RL) systems are not able to learn effectively from off-policy data, especially if they cannot explore online in the environment. These are critical shortcomings for applying RL to real-world problems where collecting data is expensive, and models must be tested offline before being deployed to interact with the environment -- e.g. systems that learn from human interaction. Thus, we develop a novel class of off-policy batch RL algorithms, which are able to effectively learn offline, without exploring, from a fixed batch of human interaction data. We leverage models pre-trained on data as a strong prior, and use KL-control to penalize divergence from this prior during RL training. We also use dropout-based uncertainty estimates to lower bound the target Q-values as a more efficient alternative to Double Q-Learning. The algorithms are tested on the problem of open-domain dialog generation -- a challenging reinforcement learning problem with a 20,000-dimensional action space. Using our Way Off-Policy algorithm, we can extract multiple different reward functions post-hoc from collected human interaction data, and learn effectively from all of these. We test the real-world generalization of these systems by deploying them live to converse with humans in an open-domain setting, and demonstrate that our algorithm achieves significant improvements over prior methods in off-policy batch RL.

arxiv preprint arxiv, machine learning, reinforcement learning, (14 more...)

arXiv.org Artificial Intelligence

1907.00456

Country: North America > United States > Massachusetts (0.28)

Genre: Research Report (0.82)

Industry: Information Technology > Services (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback