AITopics | manchu

Collaborating Authors

manchu

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries

Sakajo, Haruki, Ide, Yusuke, Vasselli, Justin, Sakai, Yusuke, Tian, Yingtao, Kamigaito, Hidetaka, Watanabe, Taro

arXiv.org Artificial IntelligenceJun-3-2025

Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages, including low-resource languages. Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources. In this work, we propose a simple yet effective vocabulary transfer method that utilizes bilingual dictionaries, which are available for many languages, thanks to descriptive linguists. Our proposed method leverages a property of BPE tokenizers where removing a subword from the vocabulary causes a fallback to shorter subwords. The embeddings of target subwords are estimated iteratively by progressively removing them from the tokenizer. The experimental results show that our approach outperforms existing methods for low-resource languages, demonstrating the effectiveness of a dictionary-based approach for cross-lingual vocabulary transfer.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2506.01535

Country:

North America > United States (1.00)
Europe (1.00)
Asia (0.93)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.88)

Add feedback

Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu

Pei, Renhao, Liu, Yihong, Lin, Peiqin, Yvon, François, Schütze, Hinrich

arXiv.org Artificial IntelligenceFeb-17-2025

In-context machine translation (MT) with large language models (LLMs) is a promising approach for low-resource MT, as it can readily take advantage of linguistic resources such as grammar books and dictionaries. Such resources are usually selectively integrated into the prompt so that LLMs can directly perform translation without any specific training, via their in-context learning capability (ICL). However, the relative importance of each type of resource e.g., dictionary, grammar book, and retrieved parallel examples, is not entirely clear. To address this gap, this study systematically investigates how each resource and its quality affects the translation performance, with the Manchu language as our case study. To remove any prior knowledge of Manchu encoded in the LLM parameters and single out the effect of ICL, we also experiment with an encrypted version of Manchu texts. Our results indicate that high-quality dictionaries and good parallel examples are very helpful, while grammars hardly help. In a follow-up study, we showcase a promising application of in-context MT: parallel data augmentation as a way to bootstrap the conventional MT model. When monolingual data abound, generating synthetic parallel data through in-context MT offers a pathway to mitigate data scarcity and build effective and efficient low-resource neural MT systems.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2502.11862

Country:

Europe (1.00)
Asia (1.00)
North America > United States > California (0.28)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

ManWav: The First Manchu ASR Model

Seo, Jean, Kang, Minha, Byun, Sungjoo, Lee, Sangah

arXiv.org Artificial IntelligenceJun-19-2024

This study addresses the widening gap in Automatic Speech Recognition (ASR) research between high resource and extremely low resource languages, with a particular focus on Manchu, a critically endangered language. Manchu exemplifies the challenges faced by marginalized linguistic communities in accessing state-of-the-art technologies. In a pioneering effort, we introduce the first-ever Manchu ASR model ManWav, leveraging Wav2Vec2-XLSR-53. The results of the first Manchu ASR is promising, especially when trained with our augmented data. Wav2Vec2-XLSR-53 fine-tuned with augmented data demonstrates a 0.02 drop in CER and 0.13 drop in WER compared to the same base model fine-tuned with original data.

manchu, manwav, recognition, (15 more...)

arXiv.org Artificial Intelligence

2406.13502

Country:

Asia > South Korea > Seoul > Seoul (0.05)
Asia > China > Heilongjiang Province (0.05)
North America (0.04)
Asia > Japan > Kyūshū & Okinawa > Kyūshū > Miyazaki Prefecture > Miyazaki (0.04)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.57)

Add feedback

Hire a Linguist!: Learning Endangered Languages with In-Context Linguistic Descriptions

Zhang, Kexun, Choi, Yee Man, Song, Zhenqiao, He, Taiqi, Wang, William Yang, Li, Lei

arXiv.org Artificial IntelligenceFeb-27-2024

How can large language models (LLMs) process and translate endangered languages? Many languages lack a large corpus to train a decent LLM; therefore existing LLMs rarely perform well in unseen, endangered languages. On the contrary, we observe that 2000 endangered languages, though without a large corpus, have a grammar book or a dictionary. We propose LINGOLLM, a training-free approach to enable an LLM to process unseen languages that hardly occur in its pre-training. Our key insight is to demonstrate linguistic knowledge of an unseen language in an LLM's prompt, including a dictionary, a grammar book, and morphologically analyzed input text. We implement LINGOLLM on top of two models, GPT-4 and Mixtral, and evaluate their performance on 5 tasks across 8 endangered or low-resource languages. Our results show that LINGOLLM elevates translation capability from GPT-4's 0 to 10.5 BLEU for 10 language directions. Our findings demonstrate the tremendous value of linguistic knowledge in the age of LLMs for endangered languages. Our data, code, and model generations can be found at https://github.com/LLiLab/llm4endangeredlang.

ingo llm, llm, translation, (14 more...)

arXiv.org Artificial Intelligence

2402.18025

Country:

Africa > The Gambia (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > Canada > Ontario > Toronto (0.04)
(7 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Mergen: The First Manchu-Korean Machine Translation Model Trained on Augmented Data

Seo, Jean, Byun, Sungjoo, Kang, Minha, Lee, Sangah

arXiv.org Artificial IntelligenceJan-12-2024

The Manchu language, with its roots in the historical Manchurian region of Northeast China, is now facing a critical threat of extinction, as there are very few speakers left. In our efforts to safeguard the Manchu language, we introduce Mergen, the first-ever attempt at a Manchu-Korean Machine Translation (MT) model. To develop this model, we utilize valuable resources such as the Manwen Laodang(a historical book) and a Manchu-Korean dictionary. Due to the scarcity of a Manchu-Korean parallel dataset, we expand our data by employing word replacement guided by GloVe embeddings, trained on both monolingual and parallel texts. Our approach is built around an encoder-decoder neural machine translation model, incorporating a bi-directional Gated Recurrent Unit (GRU) layer. The experiments have yielded promising results, showcasing a significant enhancement in Manchu-Korean translation, with a remarkable 20-30 point increase in the BLEU score.

machine translation, manchu language, translation, (12 more...)

arXiv.org Artificial Intelligence

2311.17492

Country:

Asia > South Korea > Seoul > Seoul (0.06)
Asia > Mongolia (0.05)
Asia > East Asia (0.05)
(9 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback