Efficient Continual Pre-training of LLMs for Low-resource Languages
Nag, Arijit, Chakrabarti, Soumen, Mukherjee, Animesh, Ganguly, Niloy
–arXiv.org Artificial Intelligence
Open-source Large Language models (OsLLMs) propel the democratization of natural language research by giving the flexibility to augment or update model parameters for performance improvement. Nevertheless, like proprietary LLMs, Os-LLMs offer poorer performance on low-resource languages (LRLs) than high-resource languages (HRLs), owing to smaller amounts of training data and underrepresented vocabulary. On the other hand, continual pre-training (CPT) with large amounts of language-specific data is a costly proposition in terms of data acquisition and computational resources. Our goal is to drastically reduce CPT cost. To that end, we first develop a new algorithm to select a subset of texts from a larger corpus. We show the effectiveness of our technique using very little CPT data. In search of further improvement, we design a new algorithm to select tokens to include in the LLM vocabulary. We experiment with the recent Llama-3 model and nine Indian languages with diverse scripts and extent of resource availability. For evaluation, we use IndicGenBench, a generation task benchmark dataset for Indic languages. We experiment with various CPT corpora and augmented vocabulary size and offer insights across language families.
arXiv.org Artificial Intelligence
Dec-13-2024
- Country:
- Asia
- India > West Bengal
- Kharagpur (0.04)
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.14)
- Singapore (0.04)
- India > West Bengal
- Europe > Denmark
- Capital Region > Copenhagen (0.04)
- North America
- Canada > Ontario
- Toronto (0.04)
- United States > Washington
- King County > Seattle (0.04)
- Canada > Ontario
- South America > Colombia
- Meta Department > Villavicencio (0.04)
- Asia
- Genre:
- Research Report > New Finding (0.68)
- Technology: