Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models

Lee, Cheolhyoung, Cho, Kyunghyun, Kang, Wanmo

arXiv.org Machine Learning 

A BSTRACT In natural language processing, it has been observed recently that generalization could be greatly improved by finetuning a large-scale language model pretrained on a large unlabeled corpus. Despite its recent success and wide adoption, finetun-ing a large pretrained language model on a downstream task is prone to degenerate performance when there are only a small number of training instances available. In this paper, we introduce a new regularization technique, to which we refer as "mixout", motivated by dropout. Mixout stochastically mixes the parameters of two models. We show that our mixout technique regularizes learning to minimize the deviation from one of the two models and that the strength of regularization adapts along the optimization trajectory. We empirically evaluate the proposed mixout and its variants on finetuning a pretrained language model on downstream tasks. More specifically, we demonstrate that the stability of finetuning and the average accuracy greatly increase when we use the proposed approach to regularize finetuning of BERT on downstream tasks in GLUE. 1 I NTRODUCTION Transfer learning has been widely used for the tasks in natural language processing (NLP) (Collobert et al., 2011; Devlin et al., 2018; Y ang et al., 2019; Liu et al., 2019; Phang et al., 2018). In particular, Devlin et al. (2018) recently demonstrated the effectiveness of finetuning a large-scale language model pretrained on a large, unannotated corpus on a wide range of NLP tasks including question answering and language inference. They have designed two variants of models, BERT LARGE(340M parameters) and BERT BASE(110M parameters). Although BERT LARGEoutperforms BERT BASE generally, it was observed that finetuning sometimes fails when a target dataset has fewer than 10,000 training instances (Devlin et al., 2018; Phang et al., 2018). When finetuning a big, pretrained language model, dropout (Srivastava et al., 2014) has been used as a regularization technique to prevent co-adaptation of neurons (V aswani et al., 2017; Devlin et al., 2018; Y ang et al., 2019).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found