Typhoon: Thai Large Language Models
Pipatanakul, Kunat, Jirabovonvisut, Phatrasek, Manakul, Potsawee, Sripaisarnmongkol, Sittipong, Patomwong, Ruangsak, Chokchainant, Pathomporn, Tharnpipitchai, Kasima
–arXiv.org Artificial Intelligence
Typhoon is a series of Thai large language models (LLMs) developed specifically for the Thai language. This technical report presents challenges and insights in developing Thai LLMs, including data preparation, pretraining, instruction-tuning, and evaluation. As one of the challenges of low-resource languages is the amount of pretraining data, we apply continual training to transfer existing world knowledge from a strong LLM. To evaluate the Thai knowledge encapsulated in each model from the pretraining stage, we develop ThaiExam, a benchmark based on examinations for high-school students and investment professionals in Thailand. In addition, we fine-tune Typhoon to follow Thai instructions, and we evaluate instruction-tuned models on Thai instruction datasets as well as translation, summarization, and question-answering tasks. Experimental results on a suite of Thai benchmarks show that Typhoon outperforms all open-source Thai language models, and its performance is on par with GPT-3.5 in Thai while having only 7 billion parameters and being 2.62 times more efficient in tokenizing Thai text.
arXiv.org Artificial Intelligence
Dec-21-2023
- Country:
- North America
- United States > Pennsylvania
- Philadelphia County > Philadelphia (0.04)
- Canada > Ontario
- Toronto (0.04)
- United States > Pennsylvania
- Europe
- Spain > Catalonia
- Barcelona Province > Barcelona (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Spain > Catalonia
- Asia
- Thailand (0.35)
- Southeast Asia (0.04)
- Singapore (0.04)
- Indonesia > Bali (0.04)
- Middle East
- Jordan (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- North America
- Genre:
- Research Report (0.82)
- Industry:
- Education > Educational Setting > K-12 Education > Secondary School (0.54)
- Technology: