Extending the Pre-Training of BLOOM for Improved Support of Traditional Chinese: Models, Methods and Results

Ennen, Philipp, Hsu, Po-Chun, Hsu, Chan-Jan, Liu, Chang-Le, Wu, Yen-Chen, Liao, Yin-Hsiang, Lin, Chin-Tung, Shiu, Da-Shan, Ma, Wei-Yun

Jun-23-2023–arXiv.org Artificial Intelligence

In this paper we present the multilingual language model BLOOM-zh that features enhanced support for Traditional Chinese. BLOOM-zh has its origins in the open-source BLOOM models presented by BigScience in 2022. Starting from released models, we extended the pre-training of BLOOM by additional 7.4 billion tokens in Traditional Chinese and English covering a variety of domains such as news articles, books, encyclopedias, educational materials as well as spoken language. In order to show the properties of BLOOM-zh, both existing and newly created benchmark scenarios are used for evaluating the performance. BLOOM-zh outperforms its predecessor on most Traditional Chinese benchmarks while maintaining its English capability. We release all our models to the research community.

data mining, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Jun-23-2023

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.28)
- Europe (0.92)
- North America > United States
  - Louisiana (0.14)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Education (0.93)
- Health & Medicine (0.68)

Technology:
- Information Technology
  - Artificial Intelligence
    - Machine Learning > Neural Networks (0.46)
    - Natural Language > Large Language Model (0.69)
  - Data Science > Data Mining (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found