Efficient Continual Pre-training for Building Domain Specific Large Language Models
Xie, Yong, Aggarwal, Karan, Ahmad, Aitzaz
–arXiv.org Artificial Intelligence
Large language models (LLMs) have demonstrated remarkable open-domain capabilities. Traditionally, LLMs tailored for a domain are trained from scratch to excel at handling domain-specific tasks. In this work, we explore an alternative strategy of continual pre-training as a means to develop domain-specific LLMs. We introduce FinPythia-6.9B, developed through domain-adaptive continual pre-training on the financial domain. Continual pre-trained FinPythia showcases consistent improvements on financial tasks over the original foundational model. We further explore simple but effective data selection strategies for continual pre-training. Our data selection strategies outperforms vanilla continual pre-training's performance with just 10% of corpus size and cost, without any degradation on open-domain standard tasks. Our work proposes an alternative solution to building domain-specific LLMs from scratch in a cost-effective manner.
arXiv.org Artificial Intelligence
Nov-14-2023
- Country:
- Asia > Middle East
- UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- Europe
- Denmark > Capital Region
- Copenhagen (0.04)
- France > Auvergne-Rhône-Alpes
- Germany > Berlin (0.04)
- Denmark > Capital Region
- North America
- Canada > Quebec
- Montreal (0.04)
- United States
- Illinois (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Washington > King County
- Seattle (0.04)
- Canada > Quebec
- Asia > Middle East
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Banking & Finance > Trading (0.68)
- Energy (0.67)
- Government > Regional Government
- Technology: