Offline RL for Natural Language Generation with Implicit Language Q Learning

Snell, Charlie, Kostrikov, Ilya, Su, Yi, Yang, Mengjiao, Levine, Sergey

May-1-2023–arXiv.org Artificial Intelligence

Left: an abstract depiction of an MDP where single-step RL fails to discover the optimal policy. Right: A notional illustrative example where we might expect full "multi-step" RL methods (such as ILQL) to perform significantly better than "single-step" methods. In this example, good utterances tend to start with "The movie was...", while bad utterances start with "The movie wasn't..." However, the very best examples also start with "The movie wasn't...", requiring multi-step planning or multiple steps of policy improvement to derive effective strategies. Methods that implement just a single step of policy improvement will fail to produce maximally positive sentiment outputs. While this example may appear somewhat contrived, we see in our experiments that multi-step RL methods do lead to improvements in a number of more real settings. Since ILQL performs multiple steps of policy improvement, it can significantly improve over Monte Carlo estimators or single-step RL when the underlying data is sub-optimal. One example corresponds to the notional task in Figure 4, in which the optimal sequence of actions requires traversing a state that's also frequented by sub-optimal examples. In this case, single-step RL will learn to take actions that appear safer according to the dataset -- such as the transition "The movie" "was" in Figure 4 -- whereas full ("multi-step") RL methods would recover the optimal policy. We demonstrate this empirically on the Wordle game below.

machine learning, questioner, reinforcement learning, (23 more...)

arXiv.org Artificial Intelligence

May-1-2023

arXiv.org PDF

Add feedback

Country:
- North America (0.67)

Genre:
- Research Report > New Finding (0.34)

Industry:
- Leisure & Entertainment > Games (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Learning Graphical Models > Undirected Networks
      - Markov Models (0.46)
    - Neural Networks > Deep Learning (1.00)
    - Reinforcement Learning (1.00)
  - Natural Language (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found