AITopics | quechua

Collaborating Authors

quechua

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Quechua Speech Datasets in Common Voice: The Case of Puno Quechua

Huaman, Elwin, Huaman, Wendi, Huaman, Jorge Luis, Quispe, Ninfa

arXiv.org Artificial IntelligenceOct-17-2025

Under-resourced languages, such as Quechuas, face data and resource scarcity, hindering their development in speech technology. To address this issue, Common Voice presents a crucial opportunity to foster an open and community-driven speech dataset creation. This paper examines the integration of Quechua languages into Common Voice. We detail the current 17 Quechua languages, presenting Puno Quechua (ISO 639-3: qxp) as a focused case study that includes language onboarding and corpus collection of both reading and spontaneous speech data. Our results demonstrate that Common Voice now hosts 191.1 hours of Quechua speech (86\% validated), with Puno Quechua contributing 12 hours (77\% validated), highlighting the Common Voice's potential. We further propose a research agenda addressing technical challenges, alongside ethical considerations for community engagement and indigenous data sovereignty. Our work contributes towards inclusive voice technology and digital empowerment of under-resourced language communities.

artificial intelligence, quechua, speech recognition, (13 more...)

arXiv.org Artificial Intelligence

2510.13871

Country: South America > Peru > Puno Department > Puno Province > Puno (0.91)

Genre: Research Report > New Finding (0.54)

Industry: Government (0.68)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.70)

Add feedback

Crossing Borders Without Crossing Boundaries: How Sociolinguistic Awareness Can Optimize User Engagement with Localized Spanish AI Models Across Hispanophone Countries

Capdevila, Martin, Turek, Esteban Villa, Fernandez, Ellen Karina Chumbe, Galvez, Luis Felipe Polo, Marroquin, Andrea, Quesada, Rebeca Vargas, Crew, Johanna, Galarraga, Nicole Vallejo, Rodriguez, Christopher, Gutierrez, Diego, Datla, Radhi

arXiv.org Artificial IntelligenceAug-20-2025

Large language models are, by definition, based on language. In an effort to underscore the critical need for regional localized models, this paper examines primary differences between variants of written Spanish across Latin America and Spain, with an in-depth sociocultural and linguistic contextualization therein. We argue that these differences effectively constitute significant gaps in the quotidian use of Spanish among dialectal groups by creating sociolinguistic dissonances, to the extent that locale-sensitive AI models would play a pivotal role in bridging these divides. In doing so, this approach informs better and more efficient localization strategies that also serve to more adequately meet inclusivity goals, while securing sustainable active daily user growth in a major low-risk investment geographic area. Therefore, implementing at least the proposed five sub variants of Spanish addresses two lines of action: to foment user trust and reliance on AI language models while also demonstrating a level of cultural, historical, and sociolinguistic awareness that reflects positively on any internationalization strategy.

artificial intelligence, crossing border, natural language, (16 more...)

arXiv.org Artificial Intelligence

2505.09902

Country:

South America (1.00)
North America > United States > Florida (0.28)

Genre: Research Report (0.70)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Cross-Lingual Representation Alignment Through Contrastive Image-Caption Tuning

Krasner, Nathaniel, Lanuzo, Nicholas, Anastasopoulos, Antonios

arXiv.org Artificial IntelligenceMay-21-2025

Multilingual alignment of sentence representations has mostly required bitexts to bridge the gap between languages. We investigate whether visual information can bridge this gap instead. Image caption datasets are very easy to create without requiring multilingual expertise, so this offers a more efficient alternative for low-resource languages. We find that multilingual image-caption alignment can implicitly align the text representations between languages, languages unseen by the encoder in pretraining can be incorporated into this alignment post-hoc, and these aligned representations are usable for cross-lingual Natural Language Understanding (NLU) and bitext retrieval.

artificial intelligence, natural language, text processing, (18 more...)

arXiv.org Artificial Intelligence

2505.13628

Country:

Europe (1.00)
Asia (0.68)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.35)

Add feedback

Indigenous Languages Spoken in Argentina: A Survey of NLP and Speech Resources

Ticona, Belu, Carranza, Fernando, Cotik, Viviana

arXiv.org Artificial IntelligenceFeb-7-2025

Argentina has a large yet little-known Indigenous linguistic diversity, encompassing at least 40 different languages. The majority of these languages are at risk of disappearing, resulting in a significant loss of world heritage and cultural knowledge. Currently, unified information on speakers and computational tools is lacking for these languages. In this work, we present a systematization of the Indigenous languages spoken in Argentina, classifying them into seven language families: Mapuche, Tup\'i-Guaran\'i, Guaycur\'u, Quechua, Mataco-Mataguaya, Aymara, and Chon. For each one, we present an estimation of the national Indigenous population size, based on the most recent Argentinian census. We discuss potential reasons why the census questionnaire design may underestimate the actual number of speakers. We also provide a concise survey of computational resources available for these languages, whether or not they were specifically developed for Argentinian varieties.

computational linguistic, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2501.09943

Country:

South America > Paraguay (0.15)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.06)
South America > Argentina > Pampas > Buenos Aires F.D. > Buenos Aires (0.05)
(26 more...)

Genre: Overview (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.96)

Add feedback

Think Outside the Data: Colonial Biases and Systemic Issues in Automated Moderation Pipelines for Low-Resource Languages

Shahid, Farhana, Elswah, Mona, Vashistha, Aditya

arXiv.org Artificial IntelligenceJan-23-2025

Most social media users come from non-English speaking countries in the Global South. Despite the widespread prevalence of harmful content in these regions, current moderation systems repeatedly struggle in low-resource languages spoken there. In this work, we examine the challenges AI researchers and practitioners face when building moderation tools for low-resource languages. We conducted semi-structured interviews with 22 AI researchers and practitioners specializing in automatic detection of harmful content in four diverse low-resource languages from the Global South. These are: Tamil from South Asia, Swahili from East Africa, Maghrebi Arabic from North Africa, and Quechua from South America. Our findings reveal that social media companies' restrictions on researchers' access to data exacerbate the historical marginalization of these languages, which have long lacked datasets for studying online harms. Moreover, common preprocessing techniques and language models, predominantly designed for data-rich English, fail to account for the linguistic complexity of low-resource languages. This leads to critical errors when moderating content in Tamil, Swahili, Arabic, and Quechua, which are morphologically richer than English. Based on our findings, we establish that the precarities in current moderation pipelines are rooted in deep systemic inequities and continue to reinforce historical power imbalances. We conclude by discussing multi-stakeholder approaches to improve moderation for low-resource languages.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2501.13836

Country:

Africa > East Africa (0.24)
Africa > North Africa (0.24)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
(38 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Industry:

Media (1.00)
Law (1.00)
Information Technology > Services (1.00)
(3 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback

QueEn: A Large Language Model for Quechua-English Translation

Chen, Junhao, Shu, Peng, Li, Yiwei, Zhao, Huaqin, Jiang, Hanqi, Pan, Yi, Zhou, Yifan, Liu, Zhengliang, Howe, Lewis C, Liu, Tianming

arXiv.org Artificial IntelligenceDec-6-2024

Recent studies show that large language models (LLMs) are powerful tools for working with natural language, bringing advances in many areas of computational linguistics. However, these models face challenges when applied to low-resource languages due to limited training data and difficulty in understanding cultural nuances. In this paper, we propose QueEn, a novel approach for Quechua-English translation that combines Retrieval-Augmented Generation (RAG) with parameter-efficient fine-tuning techniques. Our method leverages external linguistic resources through RAG and uses Low-Rank Adaptation (LoRA) for efficient model adaptation. Experimental results show that our approach substantially exceeds baseline models, with a BLEU score of 17.6 compared to 1.5 for standard GPT models. The integration of RAG with fine-tuning allows our system to address the challenges of low-resource language translation while maintaining computational efficiency. This work contributes to the broader goal of preserving endangered languages through advanced language technologies.

large language model, machine learning, translation, (16 more...)

arXiv.org Artificial Intelligence

2412.05184

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
South America > Peru (0.04)
South America > Ecuador (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Shortcomings of LLMs for Low-Resource Translation: Retrieval and Understanding are Both the Problem

Court, Sara, Elsner, Micha

arXiv.org Artificial IntelligenceJun-21-2024

This work investigates the in-context learning abilities of pretrained large language models (LLMs) when instructed to translate text from a low-resource language into a high-resource language as part of an automated machine translation pipeline. We conduct a set of experiments translating Southern Quechua to Spanish and examine the informativity of various types of information retrieved from a constrained database of digitized pedagogical materials (dictionaries and grammar lessons) and parallel corpora. Using both automatic and human evaluation of model output, we conduct ablation studies that manipulate (1) context type (morpheme translations, grammar descriptions, and corpus examples), (2) retrieval methods (automated vs. manual), and (3) model type. Our results suggest that even relatively small LLMs are capable of utilizing prompt context for zero-shot low-resource translation when provided a minimally sufficient amount of relevant linguistic information. However, the variable effects of prompt type, retrieval method, model type, and language-specific factors highlight the limitations of using even the best LLMs as translation systems for the majority of the world's 7,000+ languages and their speakers.

computational linguistic, information, translation, (15 more...)

arXiv.org Artificial Intelligence

2406.15625

Country:

North America > Canada > Ontario > Toronto (0.05)
North America > United States > Ohio (0.04)
Europe > Italy > Tuscany > Florence (0.04)
(13 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Killkan: The Automatic Speech Recognition Dataset for Kichwa with Morphosyntactic Information

Taguchi, Chihiro, Saransig, Jefferson, Velásquez, Dayana, Chiang, David

arXiv.org Artificial IntelligenceApr-23-2024

This paper presents Killkan, the first dataset for automatic speech recognition (ASR) in the Kichwa language, an indigenous language of Ecuador. Kichwa is an extremely low-resource endangered language, and there have been no resources before Killkan for Kichwa to be incorporated in applications of natural language processing. The dataset contains approximately 4 hours of audio with transcription, translation into Spanish, and morphosyntactic annotation in the format of Universal Dependencies. The audio data was retrieved from a publicly available radio program in Kichwa. This paper also provides corpus-linguistic analyses of the dataset with a special focus on the agglutinative morphology of Kichwa and frequent code-switching with Spanish. The experiments show that the dataset makes it possible to develop the first ASR system for Kichwa with reliable quality despite its small dataset size. This dataset, the ASR model, and the code used to develop them will be publicly available.

dataset, kichwa, transcription, (15 more...)

arXiv.org Artificial Intelligence

2404.15501

Country:

South America > Bolivia (0.04)
South America > Peru > Puno Department > Puno Province > Puno (0.04)
South America > Ecuador > Pichincha Province > Quito (0.04)
(8 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

ASR advancements for indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana

Romero, Monica, Gomez, Sandra, Torre, Iván G.

arXiv.org Artificial IntelligenceApr-12-2024

Indigenous languages are a fundamental legacy in the development of human communication, embodying the unique identity and culture of local communities of America. The Second AmericasNLP Competition Track 1 of NeurIPS 2022 proposed developing automatic speech recognition (ASR) systems for five indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana. In this paper, we propose a reliable ASR model for each target language by crawling speech corpora spanning diverse sources and applying data augmentation methods that resulted in the winning approach in this competition. To achieve this, we systematically investigated the impact of different hyperparameters by a Bayesian search on the performance of the language models, specifically focusing on the variants of the Wav2vec2.0 XLS-R model: 300M and 1B parameters. Moreover, we performed a global sensitivity analysis to assess the contribution of various hyperparametric configurations to the performances of our best models. Importantly, our results show that freeze fine-tuning updates and dropout rate are more vital parameters than the total number of epochs of lr. Additionally, we liberate our best models -- with no other ASR model reported until now for two Wa'ikhana and Kotiria -- and the many experiments performed to pave the way to other researchers to continue improving ASR in minority languages. This insight opens up interesting avenues for future work, allowing for the advancement of ASR techniques in the preservation of minority indigenous and acknowledging the complexities involved in this important endeavour.

hyperparameter, indigenous language, speech recognition, (15 more...)

arXiv.org Artificial Intelligence

2404.08368

Country:

South America > Brazil (0.05)
Europe > Spain > Galicia > Madrid (0.05)
South America > Colombia (0.05)
(14 more...)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.94)

Add feedback

Evaluating Self-Supervised Speech Representations for Indigenous American Languages

Chen, Chih-Chen, Chen, William, Zevallos, Rodolfo, Ortega, John E.

arXiv.org Artificial IntelligenceOct-8-2023

The application of self-supervision to speech representation learning has garnered significant interest in recent years, due to its scalability to large amounts of unlabeled data. However, much progress, both in terms of pre-training and downstream evaluation, has remained concentrated in monolingual models that only consider English. Few models consider other languages, and even fewer consider indigenous ones. In our submission to the New Language Track of the ASRU 2023 ML-SUPERB Challenge, we present an ASR corpus for Quechua, an indigenous South American Language. We benchmark the efficacy of large SSL models on Quechua, along with 6 other indigenous languages such as Guarani and Bribri, on low-resource ASR. Our results show surprisingly strong performance by state-of-the-art SSL models, showing the potential generalizability of large-scale models to real-world data.

computational linguistic, indigenous language, quechua, (13 more...)

arXiv.org Artificial Intelligence

2310.03639

Country:

South America > Brazil (0.05)
North America > Canada > Ontario > Toronto (0.05)
South America > Bolivia (0.05)
(18 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.70)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.49)

Add feedback