Trained on 100 million words and still in shape: BERT meets British National Corpus

Samuel, David, Kutuzov, Andrey, Øvrelid, Lilja, Velldal, Erik

May-5-2023–arXiv.org Artificial Intelligence

While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source -- the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

May-5-2023

arXiv.org PDF

Add feedback

Country:
- Asia (0.67)
- Europe (1.00)
- North America > United States
  - Minnesota (0.28)

Genre:
- Research Report (0.81)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)
  - Natural Language
    - Large Language Model (0.93)
    - Machine Translation (0.93)
    - Text Processing (1.00)
  - Representation & Reasoning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found