Sailor: Open Language Models for South-East Asia

Dou, Longxu, Liu, Qian, Zeng, Guangtao, Guo, Jia, Zhou, Jiahui, Lu, Wei, Lin, Min

Apr-4-2024–arXiv.org Artificial Intelligence

We present Sailor, a family of open language models ranging from 0.5B to 7B parameters, tailored for South-East Asian (SEA) languages. These models are continually pre-trained from Qwen1.5, a great language model for multilingual use cases. From Qwen1.5, Sailor models accept 200B to 400B tokens, primarily covering the languages of English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao. The training leverages several techniques, including BPE dropout for improving the model robustness, aggressive data cleaning and deduplication, and small proxy models to optimize data mixture. Experimental results on four typical tasks indicate that Sailor models demonstrate strong performance across different benchmarks, including commonsense reasoning, question answering, reading comprehension and examination. Embracing the open-source spirit, we share our insights through this report to spark a wider interest in developing large language models for multilingual use cases.

artificial intelligence, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

Apr-4-2024

arXiv.org PDF

Add feedback

Country:
- Asia (1.00)
- Europe > United Kingdom
  - Scotland (0.14)
- North America > United States
  - Texas (0.14)

Genre:
- Research Report (1.00)

Industry:
- Education (0.34)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found