Efficient Continual Pre-training for Building Domain Specific Large Language Models
Xie, Yong, Aggarwal, Karan, Ahmad, Aitzaz
–arXiv.org Artificial Intelligence
Large language models (LLMs) have demonstrated remarkable open-domain capabilities. Traditionally, LLMs tailored for a domain are trained from scratch to excel at handling domain-specific tasks. In this work, we explore an alternative strategy of continual pre-training as a means to develop domain-specific LLMs. We introduce FinPythia-6.9B, developed through domain-adaptive continual pre-training on the financial domain. Continual pre-trained FinPythia showcases consistent improvements on financial tasks over the original foundational model. We further explore simple but effective data selection strategies for continual pre-training. Our data selection strategies outperforms vanilla continual pre-training's performance with just 10% of corpus size and cost, without any degradation on open-domain standard tasks. Our work proposes an alternative solution to building domain-specific LLMs from scratch in a cost-effective manner.
arXiv.org Artificial Intelligence
Nov-14-2023
- Country:
- Asia > Middle East
- UAE (0.14)
- Europe (0.93)
- North America
- Canada > Quebec (0.14)
- United States > Minnesota
- Hennepin County > Minneapolis (0.14)
- Asia > Middle East
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Technology: