quechua
Quechua Speech Datasets in Common Voice: The Case of Puno Quechua
Huaman, Elwin, Huaman, Wendi, Huaman, Jorge Luis, Quispe, Ninfa
Under-resourced languages, such as Quechuas, face data and resource scarcity, hindering their development in speech technology. To address this issue, Common Voice presents a crucial opportunity to foster an open and community-driven speech dataset creation. This paper examines the integration of Quechua languages into Common Voice. We detail the current 17 Quechua languages, presenting Puno Quechua (ISO 639-3: qxp) as a focused case study that includes language onboarding and corpus collection of both reading and spontaneous speech data. Our results demonstrate that Common Voice now hosts 191.1 hours of Quechua speech (86\% validated), with Puno Quechua contributing 12 hours (77\% validated), highlighting the Common Voice's potential. We further propose a research agenda addressing technical challenges, alongside ethical considerations for community engagement and indigenous data sovereignty. Our work contributes towards inclusive voice technology and digital empowerment of under-resourced language communities.
- South America > Peru > Puno Department > Puno Province > Puno (0.91)
- South America > Peru > Lima Department > Lima Province > Lima (0.05)
- South America > Ecuador (0.04)
- (6 more...)
Crossing Borders Without Crossing Boundaries: How Sociolinguistic Awareness Can Optimize User Engagement with Localized Spanish AI Models Across Hispanophone Countries
Capdevila, Martin, Turek, Esteban Villa, Fernandez, Ellen Karina Chumbe, Galvez, Luis Felipe Polo, Marroquin, Andrea, Quesada, Rebeca Vargas, Crew, Johanna, Galarraga, Nicole Vallejo, Rodriguez, Christopher, Gutierrez, Diego, Datla, Radhi
Large language models are, by definition, based on language. In an effort to underscore the critical need for regional localized models, this paper examines primary differences between variants of written Spanish across Latin America and Spain, with an in-depth sociocultural and linguistic contextualization therein. We argue that these differences effectively constitute significant gaps in the quotidian use of Spanish among dialectal groups by creating sociolinguistic dissonances, to the extent that locale-sensitive AI models would play a pivotal role in bridging these divides. In doing so, this approach informs better and more efficient localization strategies that also serve to more adequately meet inclusivity goals, while securing sustainable active daily user growth in a major low-risk investment geographic area. Therefore, implementing at least the proposed five sub variants of Spanish addresses two lines of action: to foment user trust and reliance on AI language models while also demonstrating a level of cultural, historical, and sociolinguistic awareness that reflects positively on any internationalization strategy.
- North America > Central America (0.38)
- South America > Peru (0.06)
- South America > Ecuador (0.06)
- (38 more...)
Cross-Lingual Representation Alignment Through Contrastive Image-Caption Tuning
Krasner, Nathaniel, Lanuzo, Nicholas, Anastasopoulos, Antonios
Multilingual alignment of sentence representations has mostly required bitexts to bridge the gap between languages. We investigate whether visual information can bridge this gap instead. Image caption datasets are very easy to create without requiring multilingual expertise, so this offers a more efficient alternative for low-resource languages. We find that multilingual image-caption alignment can implicitly align the text representations between languages, languages unseen by the encoder in pretraining can be incorporated into this alignment post-hoc, and these aligned representations are usable for cross-lingual Natural Language Understanding (NLU) and bitext retrieval.
- North America > United States (0.14)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.05)
- North America > Canada (0.04)
- (6 more...)
Indigenous Languages Spoken in Argentina: A Survey of NLP and Speech Resources
Ticona, Belu, Carranza, Fernando, Cotik, Viviana
Argentina has a large yet little-known Indigenous linguistic diversity, encompassing at least 40 different languages. The majority of these languages are at risk of disappearing, resulting in a significant loss of world heritage and cultural knowledge. Currently, unified information on speakers and computational tools is lacking for these languages. In this work, we present a systematization of the Indigenous languages spoken in Argentina, classifying them into seven language families: Mapuche, Tup\'i-Guaran\'i, Guaycur\'u, Quechua, Mataco-Mataguaya, Aymara, and Chon. For each one, we present an estimation of the national Indigenous population size, based on the most recent Argentinian census. We discuss potential reasons why the census questionnaire design may underestimate the actual number of speakers. We also provide a concise survey of computational resources available for these languages, whether or not they were specifically developed for Argentinian varieties.
- South America > Paraguay (0.15)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.06)
- South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.05)
- (26 more...)
Think Outside the Data: Colonial Biases and Systemic Issues in Automated Moderation Pipelines for Low-Resource Languages
Shahid, Farhana, Elswah, Mona, Vashistha, Aditya
Most social media users come from non-English speaking countries in the Global South. Despite the widespread prevalence of harmful content in these regions, current moderation systems repeatedly struggle in low-resource languages spoken there. In this work, we examine the challenges AI researchers and practitioners face when building moderation tools for low-resource languages. We conducted semi-structured interviews with 22 AI researchers and practitioners specializing in automatic detection of harmful content in four diverse low-resource languages from the Global South. These are: Tamil from South Asia, Swahili from East Africa, Maghrebi Arabic from North Africa, and Quechua from South America. Our findings reveal that social media companies' restrictions on researchers' access to data exacerbate the historical marginalization of these languages, which have long lacked datasets for studying online harms. Moreover, common preprocessing techniques and language models, predominantly designed for data-rich English, fail to account for the linguistic complexity of low-resource languages. This leads to critical errors when moderating content in Tamil, Swahili, Arabic, and Quechua, which are morphologically richer than English. Based on our findings, we establish that the precarities in current moderation pipelines are rooted in deep systemic inequities and continue to reinforce historical power imbalances. We conclude by discussing multi-stakeholder approaches to improve moderation for low-resource languages.
- Africa > East Africa (0.24)
- Africa > North Africa (0.24)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- (38 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- Media (1.00)
- Law (1.00)
- Information Technology > Services (1.00)
- (3 more...)
QueEn: A Large Language Model for Quechua-English Translation
Chen, Junhao, Shu, Peng, Li, Yiwei, Zhao, Huaqin, Jiang, Hanqi, Pan, Yi, Zhou, Yifan, Liu, Zhengliang, Howe, Lewis C, Liu, Tianming
Recent studies show that large language models (LLMs) are powerful tools for working with natural language, bringing advances in many areas of computational linguistics. However, these models face challenges when applied to low-resource languages due to limited training data and difficulty in understanding cultural nuances. In this paper, we propose QueEn, a novel approach for Quechua-English translation that combines Retrieval-Augmented Generation (RAG) with parameter-efficient fine-tuning techniques. Our method leverages external linguistic resources through RAG and uses Low-Rank Adaptation (LoRA) for efficient model adaptation. Experimental results show that our approach substantially exceeds baseline models, with a BLEU score of 17.6 compared to 1.5 for standard GPT models. The integration of RAG with fine-tuning allows our system to address the challenges of low-resource language translation while maintaining computational efficiency. This work contributes to the broader goal of preserving endangered languages through advanced language technologies.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- South America > Peru (0.04)
- South America > Ecuador (0.04)
- (2 more...)
Shortcomings of LLMs for Low-Resource Translation: Retrieval and Understanding are Both the Problem
This work investigates the in-context learning abilities of pretrained large language models (LLMs) when instructed to translate text from a low-resource language into a high-resource language as part of an automated machine translation pipeline. We conduct a set of experiments translating Southern Quechua to Spanish and examine the informativity of various types of information retrieved from a constrained database of digitized pedagogical materials (dictionaries and grammar lessons) and parallel corpora. Using both automatic and human evaluation of model output, we conduct ablation studies that manipulate (1) context type (morpheme translations, grammar descriptions, and corpus examples), (2) retrieval methods (automated vs. manual), and (3) model type. Our results suggest that even relatively small LLMs are capable of utilizing prompt context for zero-shot low-resource translation when provided a minimally sufficient amount of relevant linguistic information. However, the variable effects of prompt type, retrieval method, model type, and language-specific factors highlight the limitations of using even the best LLMs as translation systems for the majority of the world's 7,000+ languages and their speakers.
- North America > Canada > Ontario > Toronto (0.05)
- North America > United States > Ohio (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- (13 more...)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Killkan: The Automatic Speech Recognition Dataset for Kichwa with Morphosyntactic Information
Taguchi, Chihiro, Saransig, Jefferson, Velásquez, Dayana, Chiang, David
This paper presents Killkan, the first dataset for automatic speech recognition (ASR) in the Kichwa language, an indigenous language of Ecuador. Kichwa is an extremely low-resource endangered language, and there have been no resources before Killkan for Kichwa to be incorporated in applications of natural language processing. The dataset contains approximately 4 hours of audio with transcription, translation into Spanish, and morphosyntactic annotation in the format of Universal Dependencies. The audio data was retrieved from a publicly available radio program in Kichwa. This paper also provides corpus-linguistic analyses of the dataset with a special focus on the agglutinative morphology of Kichwa and frequent code-switching with Spanish. The experiments show that the dataset makes it possible to develop the first ASR system for Kichwa with reliable quality despite its small dataset size. This dataset, the ASR model, and the code used to develop them will be publicly available.
- South America > Bolivia (0.04)
- South America > Peru > Puno Department > Puno Province > Puno (0.04)
- South America > Ecuador > Pichincha Province > Quito (0.04)
- (8 more...)
ASR advancements for indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana
Romero, Monica, Gomez, Sandra, Torre, Iván G.
Indigenous languages are a fundamental legacy in the development of human communication, embodying the unique identity and culture of local communities of America. The Second AmericasNLP Competition Track 1 of NeurIPS 2022 proposed developing automatic speech recognition (ASR) systems for five indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana. In this paper, we propose a reliable ASR model for each target language by crawling speech corpora spanning diverse sources and applying data augmentation methods that resulted in the winning approach in this competition. To achieve this, we systematically investigated the impact of different hyperparameters by a Bayesian search on the performance of the language models, specifically focusing on the variants of the Wav2vec2.0 XLS-R model: 300M and 1B parameters. Moreover, we performed a global sensitivity analysis to assess the contribution of various hyperparametric configurations to the performances of our best models. Importantly, our results show that freeze fine-tuning updates and dropout rate are more vital parameters than the total number of epochs of lr. Additionally, we liberate our best models -- with no other ASR model reported until now for two Wa'ikhana and Kotiria -- and the many experiments performed to pave the way to other researchers to continue improving ASR in minority languages. This insight opens up interesting avenues for future work, allowing for the advancement of ASR techniques in the preservation of minority indigenous and acknowledging the complexities involved in this important endeavour.
- South America > Brazil (0.05)
- Europe > Spain > Galicia > Madrid (0.05)
- South America > Colombia (0.05)
- (15 more...)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)
Evaluating Self-Supervised Speech Representations for Indigenous American Languages
Chen, Chih-Chen, Chen, William, Zevallos, Rodolfo, Ortega, John E.
The application of self-supervision to speech representation learning has garnered significant interest in recent years, due to its scalability to large amounts of unlabeled data. However, much progress, both in terms of pre-training and downstream evaluation, has remained concentrated in monolingual models that only consider English. Few models consider other languages, and even fewer consider indigenous ones. In our submission to the New Language Track of the ASRU 2023 ML-SUPERB Challenge, we present an ASR corpus for Quechua, an indigenous South American Language. We benchmark the efficacy of large SSL models on Quechua, along with 6 other indigenous languages such as Guarani and Bribri, on low-resource ASR. Our results show surprisingly strong performance by state-of-the-art SSL models, showing the potential generalizability of large-scale models to real-world data.
- South America > Brazil (0.05)
- North America > Canada > Ontario > Toronto (0.05)
- South America > Bolivia (0.05)
- (18 more...)