Automatic Document Selection for Efficient Encoder Pretraining

Feng, Yukun, Xia, Patrick, Van Durme, Benjamin, Sedoc, João

Oct-25-2022–arXiv.org Artificial Intelligence

Building pretrained language models is considered expensive and data-intensive, but must we increase dataset size to achieve better performance? We propose an alternative to larger training sets by automatically identifying smaller yet domain-representative subsets. We extend Cynical Data Selection, a statistical sentence scoring method that conditions on a representative target domain corpus. As an example, we treat the OntoNotes corpus as a target domain and pretrain a RoBERTa-like encoder from a cynically selected subset of the Pile. On both perplexity and across several downstream tasks in the target domain, it consistently outperforms random selection with 20x less data, 3x fewer training iterations, and 2x less estimated cloud compute cost, validating the recipe of automatic document selection for LM pretraining.

machine learning, natural language, selection, (19 more...)

arXiv.org Artificial Intelligence

Oct-25-2022

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York > New York County
    - New York City (0.04)
  - Minnesota > Hennepin County
    - Minneapolis (0.14)
- Europe
  - Italy (0.04)
  - Czechia > Prague (0.04)
  - Iceland > Capital Region
    - Reykjavik (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
  - Sweden > Uppsala County
    - Uppsala (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
  - United Kingdom > Scotland
    - City of Edinburgh > Edinburgh (0.04)
- Asia
  - South Korea (0.04)
  - China > Hong Kong (0.04)
  - Middle East > Qatar
    - Ad-Dawhah > Doha (0.04)

Genre:
- Research Report (0.82)

Industry:
- Law (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language
    - Text Processing (0.69)
    - Machine Translation (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found