AITopics | language community

af38fb8e90d586f209235c94119ba193-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 13:51:54 GMT

large language model, machine learning, med-unic, (17 more...)

Neural Information Processing Systems

Country:

Asia > China > Hong Kong (0.04)
North America > United States > Ohio (0.04)
Europe > Spain (0.04)
(2 more...)

Genre: Research Report (0.46)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.96)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.95)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Fit for ourpurpose, not yours: Benchmark for a low-resource, Indigenous language

Neural Information Processing SystemsFeb-11-2026, 04:54:23 GMT

The datasets contain numerous grammatical and orthographic errors, poor pronunciation, limited vocabulary, and the content lacks cultural relevance to the language community.

benchmark, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

Asia > Indonesia > Bali (0.04)
South America > Peru (0.04)
Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
(3 more...)

Genre: Research Report > Experimental Study (0.93)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias

Neural Information Processing SystemsDec-26-2025, 13:47:39 GMT

The scarcity of data presents a critical obstacle to the efficacy of medical vision-language pre-training (VLP). A potential solution lies in the combination of datasets from various language communities.Nevertheless, the main challenge stems from the complexity of integrating diverse syntax and semantics, language-specific medical terminology, and culture-specific implicit knowledge. Therefore, one crucial aspect to consider is the presence of community bias caused by different languages.This paper presents a novel framework named Unifying Cross-Lingual Medical Vision-Language Pre-Training (\textbf{Med-UniC}), designed to integrate multi-modal medical data from the two most prevalent languages, English and Spanish. Specifically, we propose \textbf{C}ross-lingual \textbf{T}ext Alignment \textbf{R}egularization (\textbf{CTR}) to explicitly unify cross-lingual semantic representations of medical reports originating from diverse language communities.

med-unic, textbf, unifying cross-lingual medical vision-language pre-training, (9 more...)

Neural Information Processing Systems

Industry: Health & Medicine (0.41)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.64)
Information Technology > Artificial Intelligence > Natural Language (0.59)

Add feedback

Fit for our purpose, not yours: Benchmark for a low-resource, Indigenous language

Neural Information Processing SystemsDec-24-2025, 23:23:08 GMT

Influential and popular benchmarks in AI are largely irrelevant to developing NLP tools for low-resource, Indigenous languages. With the primary goal of measuring the performance of general-purpose AI systems, these benchmarks fail to give due consideration and care to individual language communities, especially low-resource languages. The datasets contain numerous grammatical and orthographic errors, poor pronunciation, limited vocabulary, and the content lacks cultural relevance to the language community. To overcome the issues with these benchmarks, we have created a dataset for te reo Māori (the Indigenous language of Aotearoa/New Zealand) to pursue NLP tools that are'fit-for-our-purpose'. This paper demonstrates how low-resourced, Indigenous languages can develop tailored, high-quality benchmarks that; i. Consider the impact of colonisation on their language; ii.

artificial intelligence, indigenous language, name change, (4 more...)

Neural Information Processing Systems

Country: Oceania > New Zealand (0.28)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem

Choi, Dasol, Park, Woomyoung, Song, Youngsook

arXiv.org Artificial IntelligenceOct-16-2025

Recent advances in Natural Language Processing (NLP) have underscored the crucial role of high-quality datasets in building large language models (LLMs). However, while extensive resources and analyses exist for English, the landscape for East Asian languages - particularly Chinese, Japanese, and Korean (CJK) - remains fragmented and underexplored, despite these languages together serving over 1.6 billion speakers. To address this gap, we investigate the HuggingFace ecosystem from a cross-linguistic perspective, focusing on how cultural norms, research environments, and institutional practices shape dataset availability and quality. Drawing on more than 3,300 datasets, we employ quantitative and qualitative methods to examine how these factors drive distinct creation and curation patterns across Chinese, Japanese, and Korean NLP communities. Our findings highlight the large-scale and often institution-driven nature of Chinese datasets, grassroots community-led development in Korean NLP, and an entertainment- and subculture-focused emphasis on Japanese collections. By uncovering these patterns, we reveal practical strategies for enhancing dataset documentation, licensing clarity, and cross-lingual resource sharing - ultimately guiding more effective and culturally attuned LLM development in East Asia. We conclude by discussing best practices for future dataset curation and collaboration, aiming to strengthen resource development across all three languages.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2507.04329

Country:

North America > United States > Minnesota (0.28)
Asia > East Asia (0.24)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)

Add feedback

37f4dd559db9cd0a42ce72987d27ab27-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-9-2025, 23:24:39 GMT

benchmark, dataset, language community, (17 more...)

Neural Information Processing Systems

Country:

Asia > Indonesia > Bali (0.04)
South America > Peru (0.04)
Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
(3 more...)

Genre: Research Report > Experimental Study (0.93)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias Zhongwei Wan

Neural Information Processing SystemsOct-9-2025, 04:50:56 GMT

The scarcity of data presents a critical obstacle to the efficacy of medical vision-language pre-training (VLP).

large language model, machine learning, med-unic, (17 more...)

Neural Information Processing Systems

Country:

Asia > China > Hong Kong (0.04)
North America > United States > Ohio (0.04)
Europe > Spain (0.04)
(2 more...)

Genre: Research Report (0.46)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.96)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu

Akera, Benjamin, Nafula, Evelyn, Walukagga, Patrick, Yiga, Gilbert, Quinn, John, Mwebaze, Ernest

arXiv.org Artificial IntelligenceOct-9-2025

The development of Automatic Speech Recognition (ASR) systems for low-resource African languages remains challenging due to limited transcribed speech data. While recent advances in large multilingual models like OpenAI's Whisper offer promising pathways for low-resource ASR development, critical questions persist regarding practical deployment requirements. This paper addresses two fundamental concerns for practitioners: determining the minimum data volumes needed for viable performance and characterizing the primary failure modes that emerge in production systems. We evaluate Whisper's performance through comprehensive experiments on two Bantu languages: systematic data scaling analysis on Kinyarwanda using training sets from 1 to 1,400 hours, and detailed error characterization on Kikuyu using 270 hours of training data. Our scaling experiments demonstrate that practical ASR performance (WER < 13\%) becomes achievable with as little as 50 hours of training data, with substantial improvements continuing through 200 hours (WER < 10\%). Complementing these volume-focused findings, our error analysis reveals that data quality issues, particularly noisy ground truth transcriptions, account for 38.6\% of high-error cases, indicating that careful data curation is as critical as data volume for robust system performance. These results provide actionable benchmarks and deployment guidance for teams developing ASR systems across similar low-resource language contexts. We release accompanying and models see https://github.com/SunbirdAI/kinyarwanda-whisper-eval

data quality, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2510.07221

Country: Africa (0.15)

Genre: Research Report > New Finding (0.69)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Fit for our purpose, not yours: Benchmark for a low-resource, Indigenous language

Neural Information Processing SystemsMay-26-2025, 21:22:01 GMT

Influential and popular benchmarks in AI are largely irrelevant to developing NLP tools for low-resource, Indigenous languages. With the primary goal of measuring the performance of general-purpose AI systems, these benchmarks fail to give due consideration and care to individual language communities, especially low-resource languages. The datasets contain numerous grammatical and orthographic errors, poor pronunciation, limited vocabulary, and the content lacks cultural relevance to the language community. To overcome the issues with these benchmarks, we have created a dataset for te reo Māori (the Indigenous language of Aotearoa/New Zealand) to pursue NLP tools that are'fit-for-our-purpose'. This paper demonstrates how low-resourced, Indigenous languages can develop tailored, high-quality benchmarks that; i.

artificial intelligence, indigenous language, language community, (2 more...)

Neural Information Processing Systems

Country: Oceania > New Zealand (0.30)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias

Neural Information Processing SystemsJan-19-2025, 19:30:59 GMT

The scarcity of data presents a critical obstacle to the efficacy of medical vision-language pre-training (VLP). A potential solution lies in the combination of datasets from various language communities.Nevertheless, the main challenge stems from the complexity of integrating diverse syntax and semantics, language-specific medical terminology, and culture-specific implicit knowledge. Therefore, one crucial aspect to consider is the presence of community bias caused by different languages.This paper presents a novel framework named Unifying Cross-Lingual Medical Vision-Language Pre-Training (\textbf{Med-UniC}), designed to integrate multi-modal medical data from the two most prevalent languages, English and Spanish. Specifically, we propose \textbf{C}ross-lingual \textbf{T}ext Alignment \textbf{R}egularization (\textbf{CTR}) to explicitly unify cross-lingual semantic representations of medical reports originating from diverse language communities. Furthermore, it ensures that the cross-lingual representation is not biased toward any specific language community.\textbf{Med-UniC}

med-unic, textbf, unifying cross-lingual medical vision-language pre-training, (7 more...)

Neural Information Processing Systems

Industry: Health & Medicine (0.44)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.61)

Add feedback

Filters

Collaborating Authors

language community

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

af38fb8e90d586f209235c94119ba193-Paper-Conference.pdf

Fit for ourpurpose, not yours: Benchmark for a low-resource, Indigenous language

Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias

Fit for our purpose, not yours: Benchmark for a low-resource, Indigenous language

No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem

37f4dd559db9cd0a42ce72987d27ab27-Paper-Datasets_and_Benchmarks_Track.pdf

Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias Zhongwei Wan

How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu

Fit for our purpose, not yours: Benchmark for a low-resource, Indigenous language

Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias