Offline RL for Natural Language Generation with Implicit Language Q Learning

Snell, Charlie, Kostrikov, Ilya, Su, Yi, Yang, Mengjiao, Levine, Sergey

arXiv.org Artificial Intelligence 

Left: an abstract depiction of an MDP where single-step RL fails to discover the optimal policy. Right: A notional illustrative example where we might expect full "multi-step" RL methods (such as ILQL) to perform significantly better than "single-step" methods. In this example, good utterances tend to start with "The movie was...", while bad utterances start with "The movie wasn't..." However, the very best examples also start with "The movie wasn't...", requiring multi-step planning or multiple steps of policy improvement to derive effective strategies. Methods that implement just a single step of policy improvement will fail to produce maximally positive sentiment outputs. While this example may appear somewhat contrived, we see in our experiments that multi-step RL methods do lead to improvements in a number of more real settings. Since ILQL performs multiple steps of policy improvement, it can significantly improve over Monte Carlo estimators or single-step RL when the underlying data is sub-optimal. One example corresponds to the notional task in Figure 4, in which the optimal sequence of actions requires traversing a state that's also frequented by sub-optimal examples. In this case, single-step RL will learn to take actions that appear safer according to the dataset -- such as the transition "The movie" "was" in Figure 4 -- whereas full ("multi-step") RL methods would recover the optimal policy. We demonstrate this empirically on the Wordle game below.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found