Extending the Pre-Training of BLOOM for Improved Support of Traditional Chinese: Models, Methods and Results
Ennen, Philipp, Hsu, Po-Chun, Hsu, Chan-Jan, Liu, Chang-Le, Wu, Yen-Chen, Liao, Yin-Hsiang, Lin, Chin-Tung, Shiu, Da-Shan, Ma, Wei-Yun
–arXiv.org Artificial Intelligence
In this paper we present the multilingual language model BLOOM-zh that features enhanced support for Traditional Chinese. BLOOM-zh has its origins in the open-source BLOOM models presented by BigScience in 2022. Starting from released models, we extended the pre-training of BLOOM by additional 7.4 billion tokens in Traditional Chinese and English covering a variety of domains such as news articles, books, encyclopedias, educational materials as well as spoken language. In order to show the properties of BLOOM-zh, both existing and newly created benchmark scenarios are used for evaluating the performance. BLOOM-zh outperforms its predecessor on most Traditional Chinese benchmarks while maintaining its English capability. We release all our models to the research community.
arXiv.org Artificial Intelligence
Jun-23-2023
- Country:
- Asia > China (0.28)
- Europe (0.92)
- North America > United States
- Louisiana (0.14)
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Education (0.93)
- Health & Medicine (0.68)
- Technology: