Goto

Collaborating Authors

 uyghur


CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China

Xu, Guixian, Su, Zeli, Zhang, Ziyin, Liu, Jianing, Han, XU, Zhang, Ting, Dong, Yushuang

arXiv.org Artificial Intelligence

Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To address this gap, we introduce a novel dataset, Chinese Minority Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for headline generation tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing headline generation in Chinese minority languages and contribute to the development of related benchmarks.


CUTE: A Multilingual Dataset for Enhancing Cross-Lingual Knowledge Transfer in Low-Resource Languages

Zhuang, Wenhao, Sun, Yuan

arXiv.org Artificial Intelligence

Large Language Models (LLMs) demonstrate exceptional zero-shot capabilities in various NLP tasks, significantly enhancing user experience and efficiency. However, this advantage is primarily limited to resource-rich languages. For the diverse array of low-resource languages, support remains inadequate, with the scarcity of training corpora considered the primary cause. We construct and open-source CUTE Chinese, Uyghur, Tibetan,English dataset, consisting of two 25GB sets of four-language corpora (one parallel and one non-parallel), obtained through machine translation. CUTE encompasses two resource-rich languages (Chinese and English) and two low-resource languages (Uyghur and Tibetan). Prior to constructing CUTE, human assessment validates that the machine translation quality between Chinese-Uyghur and Chinese-Tibetan approaches that of Chinese-English translation. CUTE represents the largest open-source corpus for Uyghur and Tibetan languages to date, and we demonstrate its effectiveness in enhancing LLMs' ability to process low-resource languages while investigating the role of corpus parallelism in cross-lingual transfer learning. The CUTE corpus and related models are made publicly available to the research community.


Enhancing Cross-Lingual Transfer through Reversible Transliteration: A Huffman-Based Approach for Low-Resource Languages

Zhuang, Wenhao, Sun, Yuan, Zhao, Xiaobing

arXiv.org Artificial Intelligence

As large language models (LLMs) are trained on increasingly diverse and extensive multilingual corpora, they demonstrate cross-lingual transfer capabilities. However, these capabilities often fail to effectively extend to low-resource languages, particularly those utilizing non-Latin scripts. While transliterating low-resource languages into Latin script presents a natural solution, there currently lacks a comprehensive framework for integrating transliteration into LLMs training and deployment. Taking a pragmatic approach, this paper innovatively combines character transliteration with Huffman coding to design a complete transliteration framework. Our proposed framework offers the following advantages: 1) Compression: Reduces storage requirements for low-resource language content, achieving up to 50% reduction in file size and 50-80% reduction in token count. 2) Accuracy: Guarantees 100% lossless conversion from transliterated text back to the source language. 3) Efficiency: Eliminates the need for vocabulary expansion for low-resource languages, improving training and inference efficiency. 4) Scalability: The framework can be extended to other low-resource languages. We validate the effectiveness of our framework across multiple downstream tasks, including text classification, machine reading comprehension, and machine translation. Experimental results demonstrate that our method significantly enhances the model's capability to process low-resource languages while maintaining performance on high-resource languages. Our data and code are publicly available at https://github.com/CMLI-NLP/HuffmanTranslit.


TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages

Isbarov, Jafar, Akhundjanova, Arofat, Hajili, Mammad, Huseynova, Kavsar, Gaynullin, Dmitry, Rzayev, Anar, Tursun, Osman, Saetov, Ilshat, Kharisov, Rinat, Belginova, Saule, Kenbayeva, Ariana, Alisheva, Amina, Turdubaeva, Aizirek, Köksal, Abdullatif, Rustamov, Samir, Ataman, Duygu

arXiv.org Artificial Intelligence

Being able to thoroughly assess massive multi-task language understanding (MMLU) capabilities is essential for advancing the applicability of multilingual language models. However, preparing such benchmarks in high quality native language is often costly and therefore limits the representativeness of evaluation datasets. While recent efforts focused on building more inclusive MMLU benchmarks, these are conventionally built using machine translation from high-resource languages, which may introduce errors and fail to account for the linguistic and cultural intricacies of the target languages. In this paper, we address the lack of native language MMLU benchmark especially in the under-represented Turkic language family with distinct morphosyntactic and cultural characteristics. We propose two benchmarks for Turkic language MMLU: TUMLU is a comprehensive, multilingual, and natively developed language understanding benchmark specifically designed for Turkic languages. It consists of middle- and high-school level questions spanning 11 academic subjects in Azerbaijani, Crimean Tatar, Karakalpak, Kazakh, Tatar, Turkish, Uyghur, and Uzbek. We also present TUMLU-mini, a more concise, balanced, and manually verified subset of the dataset. Using this dataset, we systematically evaluate a diverse range of open and proprietary multilingual large language models (LLMs), including Claude, Gemini, GPT, and LLaMA, offering an in-depth analysis of their performance across different languages, subjects, and alphabets. To promote further research and development in multilingual language understanding, we release TUMLU-mini and all corresponding evaluation scripts.


Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages

Su, Zeli, Zhang, Ziyin, Xu, Guixian, Liu, Jianing, Han, XU, Zhang, Ting, Dong, Yushuang

arXiv.org Artificial Intelligence

While multilingual language models like XLM-R have advanced multilingualism in NLP, they still perform poorly in extremely low-resource languages. This situation is exacerbated by the fact that modern LLMs such as LLaMA and Qwen support far fewer languages than XLM-R, making text generation models non-existent for many languages in the world. To tackle this challenge, we propose a novel framework for adapting multilingual encoders to text generation in extremely low-resource languages. By reusing the weights between the encoder and the decoder, our framework allows the model to leverage the learned semantic space of the encoder, enabling efficient learning and effective generalization in low-resource languages. Applying this framework to four Chinese minority languages, we present XLM-SWCM, and demonstrate its superior performance on various downstream tasks even when compared with much larger models.


Towards Rehearsal-Free Multilingual ASR: A LoRA-based Case Study on Whisper

Xu, Tianyi, Huang, Kaixun, Guo, Pengcheng, Zhou, Yu, Huang, Longtao, Xue, Hui, Xie, Lei

arXiv.org Artificial Intelligence

Pre-trained multilingual speech foundation models, like Whisper, have shown impressive performance across different languages. However, adapting these models to new or specific languages is computationally extensive and faces catastrophic forgetting problems. Addressing these issues, our study investigates strategies to enhance the model on new languages in the absence of original training data, while also preserving the established performance on the original languages. Specifically, we first compare various LoRA-based methods to find out their vulnerability to forgetting. To mitigate this issue, we propose to leverage the LoRA parameters from the original model for approximate orthogonal gradient descent on the new samples. Additionally, we also introduce a learnable rank coefficient to allocate trainable parameters for more efficient training. Our experiments with a Chinese Whisper model (for Uyghur and Tibetan) yield better results with a more compact parameter set.


MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China

Zhang, Chen, Tao, Mingxu, Huang, Quzhe, Lin, Jiuheng, Chen, Zhibin, Feng, Yansong

arXiv.org Artificial Intelligence

Current large language models demonstrate deficiencies in understanding low-resource languages, particularly the minority languages in China. This limitation stems from the scarcity of available pre-training data. To address this accessibility challenge, we present MC$^2$, a Multilingual Corpus of Minority Languages in China, which is the largest open-source corpus of its kind so far. MC$^2$ includes four underrepresented languages: Tibetan, Uyghur, Kazakh, and Mongolian. Notably, we focus on the less common writing systems of Kazakh and Mongolian, i.e., Kazakh Arabic script and traditional Mongolian script, respectively, which have been long neglected in previous corpus construction efforts. Recognizing the prevalence of language contamination within existing corpora, we adopt a quality-centric solution for collecting MC$^2$, prioritizing accuracy while enhancing diversity. Furthermore, we underscore the importance of attending to the multiplicity of writing systems, which is closely related to the cultural awareness of the resulting models. The MC$^2$ corpus and related models are made public to the community.


The Download: China's chiplets, and OpenAI's DALL-E 3 watermarking

MIT Technology Review

Uyghurs outside China are traumatized. Now they're starting to talk about it The Uyghur diaspora have been forced to watch from afar as their loved ones disappear and a way of life is erased. The trauma has sparked a mental health crisis that leaders in the diaspora say is all too apparent. Many are reluctant to seek help, leaving the community's needs both underassessed and unmet. But a small group of outspoken Uyghurs is trying to change that.


2024 candidate Suarez faceplants in radio interview: 'What is a Uyghur?'

FOX News

Republican presidential candidate Francis Suarez appeared to admit during a Tuesday morning radio interview about national security that he does not know what a Uyghur is. The admission from Suarez came during an appearance on The Hugh Hewitt Show, where Hewitt asked Suarez, "Will you be talking about the Uyghurs in your campaign?" "The what," Suarez, the current mayor of Miami, responded. Republican presidential candidate and Mayor of Miami Francis Suarez delivers remarks at the Faith and Freedom Road to Majority conference on June 23, 2023, in Washington, DC. (Drew Angerer/Getty Images) "What's a Uyghur," Suarez inquired further. Moving on from the question due to Suarez's inability to identify what a Uyghur is, Hewitt told the mayor, "You've got to get smart on that."


AI program flags Chinese products allegedly linked to Uyghur forced labor: 'Not coincidence, it's a strategy'

FOX News

Mike Gallagher and Raja Krishnamoorthi explain the threat from China amid growing concerns about TikTok and the country's relationship with Russia. Tech firm Ultra has developed an artificial intelligence-powered tool it believes has helped analysts identify products coming from China through the platform Temu that were created using forced labor, possibly from the Uyghur population. "We're looking at Temu from the perspective of the Forced Labor Prevention Act," Ultra founder and CEO Ram Ben Tzion told Fox News Digital. "How many things that we don't want are coming into the country using this method, right? The good cases are counterfeit. The worst cases are poor quality. "I'm quite confident that illicit elements can find themselves going through this platform into the market, so it's time to demand accountability," he added. Ben Tzion's company created the program Publican, which pulls in huge amounts of shipping data to analyze and look for patterns and red flags for any products ...