Efficient Continual Pre-training for Building Domain Specific Large Language Models

Xie, Yong, Aggarwal, Karan, Ahmad, Aitzaz

Nov-14-2023–arXiv.org Artificial Intelligence

Large language models (LLMs) have demonstrated remarkable open-domain capabilities. Traditionally, LLMs tailored for a domain are trained from scratch to excel at handling domain-specific tasks. In this work, we explore an alternative strategy of continual pre-training as a means to develop domain-specific LLMs. We introduce FinPythia-6.9B, developed through domain-adaptive continual pre-training on the financial domain. Continual pre-trained FinPythia showcases consistent improvements on financial tasks over the original foundational model. We further explore simple but effective data selection strategies for continual pre-training. Our data selection strategies outperforms vanilla continual pre-training's performance with just 10% of corpus size and cost, without any degradation on open-domain standard tasks. Our work proposes an alternative solution to building domain-specific LLMs from scratch in a cost-effective manner.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Nov-14-2023

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - UAE (0.14)
- Europe (0.93)
- North America
  - Canada > Quebec (0.14)
  - United States > Minnesota
    - Hennepin County > Minneapolis (0.14)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Banking & Finance > Trading (0.68)
- Government > Regional Government
  - North America Government > United States Government (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.95)
  - Natural Language > Large Language Model (1.00)