Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training

Pan, Leiyu, Xiong, Bojian, Yang, Lei, Jin, Renren, Zhang, Shaowei, Chen, Yue, Shi, Ling, Zhou, Jiang, Wu, Junru, Wang, Zhen, Peng, Jianxiang, Xiao, Juesi, Dong, Tianyu, Han, Zhuowen, Chen, Zhuo, Ren, Yuqi, Xiong, Deyi

Jul-29-2025–arXiv.org Artificial Intelligence

Large language models have achieved remarkable progress across many languages. However, Tibetan, as a representative low-resource language, is particularly underrepresented in existing models due to the scarcity of high-quality training corpora. To address this gap, we curate the largest Tibetan pre-training corpus to date, aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan. With the curated data, we continue pre/post-training a multilingual base model to enhance its generative capabilities in Tibetan. To evaluate the Tibetan capabilities of the model, we create new high-quality Tibetan benchmarks, and complement them with existing public benchmarks. Experimental results demonstrate that our model consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Jul-29-2025

arXiv.org PDF

Add feedback

Country:
- Asia (1.00)

Genre:
- Research Report > New Finding (0.48)

Industry:
- Education > Educational Setting (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language
    - Machine Translation (1.00)
    - Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found