Shiu, Da-Shan
The Breeze 2 Herd of Models: Traditional Chinese LLMs Based on Llama with Vision-Aware and Function-Calling Capabilities
Research, MediaTek, :, null, Hsu, Chan-Jan, Liu, Chia-Sheng, Chen, Meng-Hsi, Chen, Muxi, Hsu, Po-Chun, Chen, Yi-Chang, Shiu, Da-Shan
Llama-Breeze2 (hereinafter referred to as Breeze2) is a suite of advanced multi-modal language models, available in 3B and 8B parameter configurations, specifically designed to enhance Traditional Chinese language representation. Building upon the Llama 3.2 model family, we continue the pre-training of Breeze2 on an extensive corpus to enhance the linguistic and cultural heritage of Traditional Chinese. In addition to language modeling capabilities, we significantly augment the models with function calling and vision understanding capabilities. At the time of this publication, as far as we are aware, absent reasoning-inducing prompts, Breeze2 are the strongest performing models in Traditional Chinese function calling and image understanding in its size class. The effectiveness of Breeze2 is benchmarked across various tasks, including Taiwan general knowledge, instruction-following, long context, function calling, and vision understanding. We are publicly releasing all Breeze2 models under the Llama 3.2 Community License. We also showcase the capabilities of the model running on mobile platform with a mobile application which we also open source.
BreezyVoice: Adapting TTS for Taiwanese Mandarin with Enhanced Polyphone Disambiguation -- Challenges and Insights
Hsu, Chan-Jan, Lin, Yi-Cheng, Lin, Chia-Chun, Chen, Wei-Chih, Chung, Ho Lam, Li, Chen-An, Chen, Yi-Chang, Yu, Chien-Yu, Lee, Ming-Ji, Chen, Chien-Cheng, Huang, Ru-Heng, Lee, Hung-yi, Shiu, Da-Shan
We present BreezyVoice, a Text-to-Speech (TTS) system specifically adapted for Taiwanese Mandarin, highlighting phonetic control abilities to address the unique challenges of polyphone disambiguation in the language. Building upon CosyVoice, we incorporate a $S^{3}$ tokenizer, a large language model (LLM), an optimal-transport conditional flow matching model (OT-CFM), and a grapheme to phoneme prediction model, to generate realistic speech that closely mimics human utterances. Our evaluation demonstrates BreezyVoice's superior performance in both general and code-switching contexts, highlighting its robustness and effectiveness in generating high-fidelity speech. Additionally, we address the challenges of generalizability in modeling long-tail speakers and polyphone disambiguation. Our approach significantly enhances performance and offers valuable insights into the workings of neural codec TTS systems.
FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from the Web
Lin, Cheng-Wei, Hsieh, Wan-Hsuan, Guan, Kai-Xin, Hsu, Chan-Jan, Kuo, Chia-Chen, Lai, Chuan-Lin, Chung, Chung-Wei, Wang, Ming-Jen, Shiu, Da-Shan
The quality and size of a pretraining dataset significantly influence the performance of large language models (LLMs). While there have been numerous efforts in the curation of such a dataset for English users, there is a relative lack of similar initiatives for Traditional Chinese. Building upon this foundation of FineWeb, we introduce FineWeb-zhtw, a dataset tailored specifically for Traditional Chinese users. We came up with multiple stages of meticulously designed filters to cater to the linguistic difference between English and Traditional Chinese, to ensure comprehensiveness and quality. We determined effectiveness from querying dataset samples with three main objectives. Our code and datasets are publicly available.
Breeze-7B Technical Report
Hsu, Chan-Jan, Liu, Chang-Le, Liao, Feng-Ting, Hsu, Po-Chun, Chen, Yi-Chang, Shiu, Da-Shan
Breeze-7B is an open-source language model based on Mistral-7B, designed to address the need for improved language comprehension and chatbot-oriented capabilities in Traditional Chinese. This technical report provides an overview of the additional pretraining, finetuning, and evaluation stages for the Breeze-7B model. The Breeze-7B family of base and chat models exhibits good performance on language comprehension and chatbot-oriented tasks, reaching the top in several benchmarks among models comparable in its complexity class.
Extending the Pre-Training of BLOOM for Improved Support of Traditional Chinese: Models, Methods and Results
Ennen, Philipp, Hsu, Po-Chun, Hsu, Chan-Jan, Liu, Chang-Le, Wu, Yen-Chen, Liao, Yin-Hsiang, Lin, Chin-Tung, Shiu, Da-Shan, Ma, Wei-Yun
In this paper we present the multilingual language model BLOOM-zh that features enhanced support for Traditional Chinese. BLOOM-zh has its origins in the open-source BLOOM models presented by BigScience in 2022. Starting from released models, we extended the pre-training of BLOOM by additional 7.4 billion tokens in Traditional Chinese and English covering a variety of domains such as news articles, books, encyclopedias, educational materials as well as spoken language. In order to show the properties of BLOOM-zh, both existing and newly created benchmark scenarios are used for evaluating the performance. BLOOM-zh outperforms its predecessor on most Traditional Chinese benchmarks while maintaining its English capability. We release all our models to the research community.