The Effects of In-domain Corpus Size on pre-training BERT

Sanchez, Chris, Zhang, Zheyuan

arXiv.org Artificial Intelligence 

Web scraping Encoder Representations from Transformers is one oft-cited method used to gather publicly (BERT) (Devlin et al., 2018) and its variants available documents to increase one's in-domain (Liu et al., 2019; Yang et al., 2019; Lan et al., 2019) training corpora. For example, LEGAL-BERT has proven to be an excellent strategy and achieved (Chalkidis et al., 2020) authors scraped publicly state-of-the-art results on many downstream natural available legal text from six different sources, to language processing (NLP) tasks. Most models achieve a total corpus size of 12 GB. Nevertheless, focused their pre-training efforts on general domain this data collection process is laborious and text. For example, the original BERT model was time-consuming and could discourage researchers trained on Wikipedia and the BookCorpus (Zhu from conducting such experiments for fear of being et al., 2015). Many other following efforts focused unable to collect enough data. On the other hand, on adding additional texts to the pre-training process it would also be a waste of resources if, after all to create even larger models with the intent the data is collected, it turns out the data is still of improving model performance (Liu et al., 2019; not enough for pre-training and the model ends up Raffel et al., 2019). However, recent works have having poor performance.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found