Goto

Collaborating Authors

 unified language model pre-training




Reviews: Unified Language Model Pre-training for Natural Language Understanding and Generation

Neural Information Processing Systems

This paper provides a method to pretrain a single Transformer architecture on three objectives: (i) unidirectional language model (e.g. This unified architecture circumvents the shortcoming of both models like BERT (which can condition on bidirectional context, but harder to use for downstream tasks that involve generation due to bidirectionality) and GPT-2 (easy to apply for generation tasks since it works left-to-right, but bidirectional encoders have been known to work much better than unidirectional ones in sequence-to-sequence models), and thereby combines the best of both worlds. This is done using a simple masking scheme that restricts which words the model can pay attention to, depending on which objective function is used (e.g. if using a unidirectional, left-to-right objective, then all tokens to the right of the target word are masked out). Experiments on text summarisation (CNN/DailyMail and Gigaword), question answering (SQuAD, CoQA extractive, and CoQA abstractive), question generation, and GLUE indicate that the proposed pretraining approach largely matches or surpasses the current state of the art. Their masking approach crucially enables pretraining the two key ingredients of sequence-to-sequence models with a single model: (i) a bidirectional encoder, and (ii) a unidirectional decoder.


Reviews: Unified Language Model Pre-training for Natural Language Understanding and Generation

Neural Information Processing Systems

This paper presents an alternative training regime for the BERT contextual embedding model that incorporates additional conditioning contexts such as left to right language modelling and sequence transduction. The reviewers agree that the work is well motivated and is a reasonable attempt to address some of the issues with the original BERT model. The results are suitably strong, and as such this paper is likely to be of interest to those working on contextual embedding models, although it is puzzling that a classic language modelling perplexity evaluation was not included, given this is one of the objectives that the model optimises. The author's final paper should incorporate the answers to the questions raised by the reviewers.