Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords

Golchin, Shahriar, Surdeanu, Mihai, Tavabi, Nazgol, Kiapour, Ata

Jul-14-2023–arXiv.org Artificial Intelligence

We propose a novel task-agnostic in-domain pre-training method that sits between generic pre-training and fine-tuning. Our approach selectively masks in-domain keywords, i.e., words that provide a compact representation of the target domain. We identify such keywords using KeyBERT (Grootendorst, 2020). We evaluate our approach using six different settings: three datasets combined with two distinct pre-trained language models (PLMs). Our results reveal that the fine-tuned PLMs adapted using our in-domain pre-training strategy outperform PLMs that used in-domain pre-training with random masking as well as those that followed the common pre-train-then-fine-tune paradigm. Further, the overhead of identifying in-domain keywords is reasonable, e.g., 7-15% of the pre-training time (for two epochs) for BERT Large (Devlin et al., 2019).

information retrieval, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Jul-14-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Arizona > Pima County
    - Tucson (0.14)
  - Minnesota > Hennepin County
    - Minneapolis (0.14)

Genre:
- Research Report
  - Experimental Study (0.69)
  - New Finding (1.00)

Industry:
- Health & Medicine (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language
    - Information Retrieval (0.71)
    - Text Processing (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found