Learning Natural Language Generation from Scratch

Donati, Alice Martin, Quispe, Guillaume, Ollion, Charles, Corff, Sylvain Le, Strub, Florian, Pietquin, Olivier

arXiv.org Machine Learning 

Since the development of generic language models trained on massive unlabelled text corpora (Radford et al., 2019; Brown et al., 2020), state-of-the art language processing systems rely on sequential transfer learning (Ruder, 2019). The pretrained Language Model (LM) is fine-tuned on the downstream task using a standard supervised learning (SL) objective (Wu et al., 2019; Peters et al., 2019). Yet, such an approach suffers from several issues (Chen et al., 2020): (i) catastrophic forgetting when a model forgets previously learned knowledge and overfits to target domains, (ii) computational inefficiency from fine-tuning billionparameters networks, and (iii) the need of supervised datasets. Moreover, task-specific language models learned with SL suffer from well-studied text degeneration issues (Holtzman et al., 2019), such as the exposure bias (Bengio et al., 2015), language biases (Saleh et al., 2020; Jaques et al., 2020), or a lack of diversity (Li et al., 2015). On the other hand, text generation can be naturally framed as a sequential decision making problem, with the sequence of words seen as successive actions over a vocabulary. Thus, some researchers have recently focused on learning language models using instead Reinforcement Learning (RL) (Strub et al., 2017; Das et al., 2017; Narasimhan et al., 2015).