AITopics | uyghur

Collaborating Authors

uyghur

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China

Xu, Guixian, Su, Zeli, Zhang, Ziyin, Liu, Jianing, Han, XU, Zhang, Ting, Dong, Yushuang

arXiv.org Artificial IntelligenceOct-24-2025

Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To address this gap, we introduce a novel dataset, Chinese Minority Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for headline generation tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing headline generation in Chinese minority languages and contribute to the development of related benchmarks.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2509.0999

Country: Asia > China (1.00)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.30)

Add feedback

CUTE: A Multilingual Dataset for Enhancing Cross-Lingual Knowledge Transfer in Low-Resource Languages

Zhuang, Wenhao, Sun, Yuan

arXiv.org Artificial IntelligenceSep-23-2025

Large Language Models (LLMs) demonstrate exceptional zero-shot capabilities in various NLP tasks, significantly enhancing user experience and efficiency. However, this advantage is primarily limited to resource-rich languages. For the diverse array of low-resource languages, support remains inadequate, with the scarcity of training corpora considered the primary cause. We construct and open-source CUTE Chinese, Uyghur, Tibetan,English dataset, consisting of two 25GB sets of four-language corpora (one parallel and one non-parallel), obtained through machine translation. CUTE encompasses two resource-rich languages (Chinese and English) and two low-resource languages (Uyghur and Tibetan). Prior to constructing CUTE, human assessment validates that the machine translation quality between Chinese-Uyghur and Chinese-Tibetan approaches that of Chinese-English translation. CUTE represents the largest open-source corpus for Uyghur and Tibetan languages to date, and we demonstrate its effectiveness in enhancing LLMs' ability to process low-resource languages while investigating the role of corpus parallelism in cross-lingual transfer learning. The CUTE corpus and related models are made publicly available to the research community.

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2509.16914

Country: Asia > China (0.15)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Enhancing Cross-Lingual Transfer through Reversible Transliteration: A Huffman-Based Approach for Low-Resource Languages

Zhuang, Wenhao, Sun, Yuan, Zhao, Xiaobing

arXiv.org Artificial IntelligenceSep-23-2025

As large language models (LLMs) are trained on increasingly diverse and extensive multilingual corpora, they demonstrate cross-lingual transfer capabilities. However, these capabilities often fail to effectively extend to low-resource languages, particularly those utilizing non-Latin scripts. While transliterating low-resource languages into Latin script presents a natural solution, there currently lacks a comprehensive framework for integrating transliteration into LLMs training and deployment. Taking a pragmatic approach, this paper innovatively combines character transliteration with Huffman coding to design a complete transliteration framework. Our proposed framework offers the following advantages: 1) Compression: Reduces storage requirements for low-resource language content, achieving up to 50% reduction in file size and 50-80% reduction in token count. 2) Accuracy: Guarantees 100% lossless conversion from transliterated text back to the source language. 3) Efficiency: Eliminates the need for vocabulary expansion for low-resource languages, improving training and inference efficiency. 4) Scalability: The framework can be extended to other low-resource languages. We validate the effectiveness of our framework across multiple downstream tasks, including text classification, machine reading comprehension, and machine translation. Experimental results demonstrate that our method significantly enhances the model's capability to process low-resource languages while maintaining performance on high-resource languages. Our data and code are publicly available at https://github.com/CMLI-NLP/HuffmanTranslit.

artificial intelligence, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2025.acl-long.795

2509.17493

Country: Asia > China (0.15)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages

Isbarov, Jafar, Akhundjanova, Arofat, Hajili, Mammad, Huseynova, Kavsar, Gaynullin, Dmitry, Rzayev, Anar, Tursun, Osman, Saetov, Ilshat, Kharisov, Rinat, Belginova, Saule, Kenbayeva, Ariana, Alisheva, Amina, Turdubaeva, Aizirek, Köksal, Abdullatif, Rustamov, Samir, Ataman, Duygu

arXiv.org Artificial IntelligenceFeb-16-2025

Being able to thoroughly assess massive multi-task language understanding (MMLU) capabilities is essential for advancing the applicability of multilingual language models. However, preparing such benchmarks in high quality native language is often costly and therefore limits the representativeness of evaluation datasets. While recent efforts focused on building more inclusive MMLU benchmarks, these are conventionally built using machine translation from high-resource languages, which may introduce errors and fail to account for the linguistic and cultural intricacies of the target languages. In this paper, we address the lack of native language MMLU benchmark especially in the under-represented Turkic language family with distinct morphosyntactic and cultural characteristics. We propose two benchmarks for Turkic language MMLU: TUMLU is a comprehensive, multilingual, and natively developed language understanding benchmark specifically designed for Turkic languages. It consists of middle- and high-school level questions spanning 11 academic subjects in Azerbaijani, Crimean Tatar, Karakalpak, Kazakh, Tatar, Turkish, Uyghur, and Uzbek. We also present TUMLU-mini, a more concise, balanced, and manually verified subset of the dataset. Using this dataset, we systematically evaluate a diverse range of open and proprietary multilingual large language models (LLMs), including Claude, Gemini, GPT, and LLaMA, offering an in-depth analysis of their performance across different languages, subjects, and alphabets. To promote further research and development in multilingual language understanding, we release TUMLU-mini and all corresponding evaluation scripts.

benchmark, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2502.1102

Country:

Europe (1.00)
North America > United States (0.68)

Genre: Research Report (0.82)

Industry: Education > Educational Setting > K-12 Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages

Su, Zeli, Zhang, Ziyin, Xu, Guixian, Liu, Jianing, Han, XU, Zhang, Ting, Dong, Yushuang

arXiv.org Artificial IntelligenceFeb-15-2025

While multilingual language models like XLM-R have advanced multilingualism in NLP, they still perform poorly in extremely low-resource languages. This situation is exacerbated by the fact that modern LLMs such as LLaMA and Qwen support far fewer languages than XLM-R, making text generation models non-existent for many languages in the world. To tackle this challenge, we propose a novel framework for adapting multilingual encoders to text generation in extremely low-resource languages. By reusing the weights between the encoder and the decoder, our framework allows the model to leverage the learned semantic space of the encoder, enabling efficient learning and effective generalization in low-resource languages. Applying this framework to four Chinese minority languages, we present XLM-SWCM, and demonstrate its superior performance on various downstream tasks even when compared with much larger models.

artificial intelligence, machine translation, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.10852

Country:

Europe (1.00)
North America (0.93)
Asia > China (0.48)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)

Add feedback

Towards Rehearsal-Free Multilingual ASR: A LoRA-based Case Study on Whisper

Xu, Tianyi, Huang, Kaixun, Guo, Pengcheng, Zhou, Yu, Huang, Longtao, Xue, Hui, Xie, Lei

arXiv.org Artificial IntelligenceAug-20-2024

Pre-trained multilingual speech foundation models, like Whisper, have shown impressive performance across different languages. However, adapting these models to new or specific languages is computationally extensive and faces catastrophic forgetting problems. Addressing these issues, our study investigates strategies to enhance the model on new languages in the absence of original training data, while also preserving the established performance on the original languages. Specifically, we first compare various LoRA-based methods to find out their vulnerability to forgetting. To mitigate this issue, we propose to leverage the LoRA parameters from the original model for approximate orthogonal gradient descent on the new samples. Additionally, we also introduce a learnable rank coefficient to allocate trainable parameters for more efficient training. Our experiments with a Chinese Whisper model (for Uyghur and Tibetan) yield better results with a more compact parameter set.

continual learning, dataset, matrix, (15 more...)

arXiv.org Artificial Intelligence

2408.1068

Country:

Africa > Rwanda > Kigali > Kigali (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > Canada > Ontario > Toronto (0.04)
(6 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.49)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China

Zhang, Chen, Tao, Mingxu, Huang, Quzhe, Lin, Jiuheng, Chen, Zhibin, Feng, Yansong

arXiv.org Artificial IntelligenceJun-13-2024

Current large language models demonstrate deficiencies in understanding low-resource languages, particularly the minority languages in China. This limitation stems from the scarcity of available pre-training data. To address this accessibility challenge, we present MC$^2$, a Multilingual Corpus of Minority Languages in China, which is the largest open-source corpus of its kind so far. MC$^2$ includes four underrepresented languages: Tibetan, Uyghur, Kazakh, and Mongolian. Notably, we focus on the less common writing systems of Kazakh and Mongolian, i.e., Kazakh Arabic script and traditional Mongolian script, respectively, which have been long neglected in previous corpus construction efforts. Recognizing the prevalence of language contamination within existing corpora, we adopt a quality-centric solution for collecting MC$^2$, prioritizing accuracy while enhancing diversity. Furthermore, we underscore the importance of attending to the multiplicity of writing systems, which is closely related to the cultural awareness of the resulting models. The MC$^2$ corpus and related models are made public to the community.

computational linguistic, corpus, mc 2, (16 more...)

arXiv.org Artificial Intelligence

2311.08348

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
North America > Canada > Ontario > Toronto (0.04)
(12 more...)

Genre:

Research Report (0.64)
Overview (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)

Add feedback

The Download: China's chiplets, and OpenAI's DALL-E 3 watermarking

MIT Technology ReviewFeb-7-2024, 13:10:00 GMT

Uyghurs outside China are traumatized. Now they're starting to talk about it The Uyghur diaspora have been forced to watch from afar as their loved ones disappear and a way of life is erased. The trauma has sparked a mental health crisis that leaders in the diaspora say is all too apparent. Many are reluctant to seek help, leaving the community's needs both underassessed and unmet. But a small group of outspoken Uyghurs is trying to change that.

china, dall-e 3, download, (4 more...)

MIT Technology Review

Country: Asia > China (0.65)

Industry:

Health & Medicine (0.55)
Information Technology > Security & Privacy (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.85)
Information Technology > Artificial Intelligence > Natural Language (0.76)

Add feedback

2024 candidate Suarez faceplants in radio interview: 'What is a Uyghur?'

FOX NewsJun-27-2023, 14:36:47 GMT

Republican presidential candidate Francis Suarez appeared to admit during a Tuesday morning radio interview about national security that he does not know what a Uyghur is. The admission from Suarez came during an appearance on The Hugh Hewitt Show, where Hewitt asked Suarez, "Will you be talking about the Uyghurs in your campaign?" "The what," Suarez, the current mayor of Miami, responded. Republican presidential candidate and Mayor of Miami Francis Suarez delivers remarks at the Faith and Freedom Road to Majority conference on June 23, 2023, in Washington, DC. (Drew Angerer/Getty Images) "What's a Uyghur," Suarez inquired further. Moving on from the question due to Suarez's inability to identify what a Uyghur is, Hewitt told the mayor, "You've got to get smart on that."

interview, suarez, uyghur, (12 more...)

FOX News

Country:

North America > United States > District of Columbia > Washington (0.25)
North America > Canada > Ontario > Middlesex County > London (0.05)
Asia > East Asia (0.05)
(2 more...)

Genre: Personal > Interview (0.72)

Industry:

Government > Voting & Elections (1.00)
Government > Regional Government (1.00)

Technology: Information Technology > Artificial Intelligence (0.31)

Add feedback

AI program flags Chinese products allegedly linked to Uyghur forced labor: 'Not coincidence, it's a strategy'

FOX NewsJun-16-2023, 06:00:16 GMT

Mike Gallagher and Raja Krishnamoorthi explain the threat from China amid growing concerns about TikTok and the country's relationship with Russia. Tech firm Ultra has developed an artificial intelligence-powered tool it believes has helped analysts identify products coming from China through the platform Temu that were created using forced labor, possibly from the Uyghur population. "We're looking at Temu from the perspective of the Forced Labor Prevention Act," Ultra founder and CEO Ram Ben Tzion told Fox News Digital. "How many things that we don't want are coming into the country using this method, right? The good cases are counterfeit. The worst cases are poor quality. "I'm quite confident that illicit elements can find themselves going through this platform into the market, so it's time to demand accountability," he added. Ben Tzion's company created the program Publican, which pulls in huge amounts of shipping data to analyze and look for patterns and red flags for any products ...

ben tzion, china, temu, (13 more...)

FOX News

Country:

Europe > Russia (0.25)
Asia > Russia (0.25)
South America (0.05)
(9 more...)

Industry:

Law > Labor & Employment Law (0.83)
Government > Regional Government > North America Government > United States Government (0.30)

Technology:

Information Technology > Artificial Intelligence (1.00)
Information Technology > Communications > Social Media (0.76)

Add feedback