"What do data-rich models know that models with less pre-training data do not?" The performance of language models is determined mostly by the amount of training data, quality of the training data and choice of modelling technique for estimation. Pretrained language models like BERT use massive datasets on the order of tens or even hundreds of billions of words to learn linguistic features and world knowledge, and they can be fine-tuned to achieve good performance on many downstream tasks. General-purpose pre-trained language models achieve strong performance on NLU tasks through pretraining on billions of words. But what exact knowledge, ask the researchers at NYU, do these models learn from large scale pretraining that they cannot learn from less data? To understand the relation between massiveness of data and learning in language models, the researchers adopted four probing methods -- classifier probing, information-theoretic probing, unsupervised relative acceptability judgment, and fine-tuning on NLU tasks and plotted to learn curves (shown above) for the four probing methods.
Nov-20-2020, 05:15:40 GMT