The Effects of In-domain Corpus Size on pre-training BERT

Dec-15-2022–arXiv.org Artificial Intelligence

Web scraping Encoder Representations from Transformers is one oft-cited method used to gather publicly (BERT) (Devlin et al., 2018) and its variants available documents to increase one's in-domain (Liu et al., 2019; Yang et al., 2019; Lan et al., 2019) training corpora. For example, LEGAL-BERT has proven to be an excellent strategy and achieved (Chalkidis et al., 2020) authors scraped publicly state-of-the-art results on many downstream natural available legal text from six different sources, to language processing (NLP) tasks. Most models achieve a total corpus size of 12 GB. Nevertheless, focused their pre-training efforts on general domain this data collection process is laborious and text. For example, the original BERT model was time-consuming and could discourage researchers trained on Wikipedia and the BookCorpus (Zhu from conducting such experiments for fear of being et al., 2015). Many other following efforts focused unable to collect enough data. On the other hand, on adding additional texts to the pre-training process it would also be a waste of resources if, after all to create even larger models with the intent the data is collected, it turns out the data is still of improving model performance (Liu et al., 2019; not enough for pre-training and the model ends up Raffel et al., 2019). However, recent works have having poor performance.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Dec-15-2022

arXiv.org PDF

Add feedback

Country:
- North America > United States > Virginia > Fairfax County > Reston (0.04)

Genre:
- Research Report > New Finding (0.69)

Industry:
- Health & Medicine > Pharmaceuticals & Biotechnology (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found