AITopics | resource language

Collaborating Authors

resource language

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

d324a0cc02881779dcda44a675fdcaaa-Paper.pdf

Neural Information Processing SystemsFeb-11-2026, 07:31:45 GMT

arxiv preprint arxiv, multilingual model, resource language, (14 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Asia > Indonesia > Bali (0.04)
Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Robust Optimization for Multilingual Translation with Imbalanced Data

Neural Information Processing SystemsDec-24-2025, 23:12:11 GMT

Multilingual models are parameter-efficient and especially effective in improving low-resource languages by leveraging crosslingual transfer. Despite recent advance in massive multilingual translation with ever-growing model and data, how to effectively train multilingual models has not been well understood. In this paper, we show that a common situation in multilingual training, data imbalance among languages, poses optimization tension between high resource and low resource languages where the found multilingual solution is often sub-optimal for low resources. We show that common training method which upsamples low resources can not robustly optimize population loss with risks of either underfitting high resource languages or overfitting low resource ones. Drawing on recent findings on the geometry of loss landscape and its effect on generalization, we propose a principled optimization algorithm, Curvature Aware Task Scaling (CATS), which adaptively rescales gradients from different tasks with a meta objective of guiding multilingual training to low-curvature neighborhoods with uniformly low loss for all languages. We ran experiments on common benchmarks (TED, WMT and OPUS-100) with varying degrees of data imbalance. CATS effectively improved multilingual optimization and as a result demonstrated consistent gains on low resources ($+0.8$ to $+2.2$ BLEU) without hurting high resources. In addition, CATS is robust to overparameterization and large batch size training, making it a promising training method for massive multilingual models that truly improve low resource languages.

multilingual translation, name change, robust optimization, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.76)

Add feedback

MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation

Rahmani, Mahdi, Saffari, AmirHossein, Rahmani, Reyhane

arXiv.org Artificial IntelligenceDec-1-2025

Small and medium - sized enterprises (SMEs) in Iran increasingly leverage Telegram for sales, where real - time engagement is essential for conversion. However, developing AI - driven chatbots for this purpose requires large, high - quality question - and - answer (Q&A) datasets, which are typically expensive and resource - intensive to produce, especially for low - resource languages like Persian. In this paper, we introduce MegaChat, the first fully synthetic Persian Q&A dataset designed to evaluate intelligent sales ch atbots in Telegram - based e - commerce. We propose a novel, automated multi - agent architecture that generates persona - aware Q&A pairs by collecting data from active Telegram shopping channels. The system employs specialized agents for question generation, validation, and refinement, ensuring the production of realistic and diverse conversational data. To evaluate answer generation, we compare three classic retrieval - augmented generation (RAG) models with our advanced agentic system, which features multi - query retrieval, reranking, and persona - aligned response synthesis. Using GPT - 5.1 for evaluation across six quality dimensions, our results show that the agentic architecture outperformed traditional RAG models in 4 out of 5 diverse channels, demonstrating its ability to generate scalable, high - quality datasets without relying on expensive human annotation or complex fine - tuning. MegaChat provides SMEs with an efficient, cost - effective solution for building intelligent customer engagement systems in specialized c ommercial domains, enabling advancements in multilingual conversational AI for low - resource languages.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.23397

Country:

Asia > Middle East > Iran (0.36)
North America > United States (0.28)

Genre: Research Report > New Finding (0.86)

Industry: Information Technology > Services (0.36)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

KurdSTS: The Kurdish Semantic Textual Similarity

Abdullah, Abdulhady Abas, Veisi, Hadi, Al, Hussein M.

arXiv.org Artificial IntelligenceDec-1-2025

Semantic Textual Similarity measures the degree of equivalence between the two texts and is important in many Natural Language Processing tasks. While extensive resources have been developed for high - resource languages, unfortunately, low - resource languages, for example, Kurdish, have been neglected. In this paper, the first STS dataset for K urdish has been introduced, which aims to alleviate this gap. This dataset contains 10,000 formal and informal sentence pairs annotated for similarity. To this end, aft er benchmarking several models, such as Sentence Bidirectional Encoder Representations from Transformers (Sentence - BERT) and multilingual Bidirectional Encoder Representations from Transformers (multilingual BERT), among others, which achieved promising results while also showcasing the difficulties presented by the distinctive nature of Kurdish. This work paves the way for future studies in Kurdish semantic research and Natural Language Processing in general for other low - resource languages.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.02336

Country: Asia > Middle East > Iraq > Kurdistan Region (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models

Sahili, Zahraa Al, Patras, Ioannis, Purver, Matthew

arXiv.org Artificial IntelligenceNov-20-2025

Multilingual vision-language models (VLMs) promise universal image-text retrieval, yet their social biases remain underexplored. We perform the first systematic audit of four public multilingual CLIP variants: M-CLIP, NLLB-CLIP, CAPIVARA-CLIP, and the debiased SigLIP-2, covering ten languages that differ in resource availability and morphological gender marking. Using balanced subsets of FairFace and the PATA stereotype suite in a zero-shot setting, we quantify race and gender bias and measure stereotype amplification. Contrary to the intuition that multilinguality mitigates bias, every model exhibits stronger gender skew than its English-only baseline. CAPIVARA-CLIP shows its largest biases precisely in the low-resource languages it targets, while the shared encoder of NLLB-CLIP and SigLIP-2 transfers English gender stereotypes into gender-neutral languages; loosely coupled encoders largely avoid this leakage. Although SigLIP-2 reduces agency and communion skews, it inherits -- and in caption-sparse contexts (e.g., Xhosa) amplifies -- the English anchor's crime associations. Highly gendered languages consistently magnify all bias types, yet gender-neutral languages remain vulnerable whenever cross-lingual weight sharing imports foreign stereotypes. Aggregated metrics thus mask language-specific hot spots, underscoring the need for fine-grained, language-aware bias evaluation in future multilingual VLM research.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2505.1416

Country:

Europe (1.00)
North America > United States (0.28)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

EmoBang: Detecting Emotion From Bengali Texts

Maruf, Abdullah Al, Golder, Aditi, Jiyad, Zakaria Masud, Numan, Abdullah Al, Zaman, Tarannum Shaila

arXiv.org Artificial IntelligenceNov-11-2025

Emotion detection from text seeks to identify an individual's emotional or mental state - positive, negative, or neutral - based on linguistic cues. While significant progress has been made for English and other high-resource languages, Bengali remains underexplored despite being the world's fourth most spoken language. The lack of large, standardized datasets classifies Bengali as a low-resource language for emotion detection. Existing studies mainly employ classical machine learning models with traditional feature engineering, yielding limited performance. In this paper, we introduce a new Bengali emotion dataset annotated across eight emotion categories and propose two models for automatic emotion detection: (i) a hybrid Convolutional Recurrent Neural Network (CRNN) model (EmoBangHybrid) and (ii) an AdaBoost-Bidirectional Encoder Representations from Transformers (BERT) ensemble model (EmoBangEnsemble). Additionally, we evaluate six baseline models with five feature engineering techniques and assess zero-shot and few-shot large language models (LLMs) on the dataset. To the best of our knowledge, this is the first comprehensive benchmark for Bengali emotion detection. Experimental results show that EmoBangH and EmoBangE achieve accuracies of 92.86% and 93.69%, respectively, outperforming existing methods and establishing strong baselines for future research.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.07077

Country:

North America > United States (0.68)
Asia (0.46)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Arabic Little STT: Arabic Children Speech Recognition Dataset

Alkadri, Mouhand, Desouki, Dania, Jallad, Khloud Al

arXiv.org Artificial IntelligenceOct-28-2025

The performance of Artificial Intelligence (AI) systems fundamentally depends on high-quality training data. However, low-resource languages like Arabic suffer from severe data scarcity. Moreover, the absence of child-specific speech corpora is an essential gap that poses significant challenges. To address this gap, we present our created dataset, Arabic Little STT, a dataset of Levantine Arabic child speech recorded in classrooms, containing 355 utterances from 288 children (ages 6 - 13). We further conduct a systematic assessment of Whisper, a state-of-the-art automatic speech recognition (ASR) model, on this dataset and compare its performance with adult Arabic benchmarks. Our evaluation across eight Whisper variants reveals that even the best-performing model (Large_v3) struggles significantly, achieving a 0.66 word error rate (WER) on child speech, starkly contrasting with its sub 0.20 WER on adult datasets. These results align with other research on English speech. Results highlight the critical need for dedicated child speech benchmarks and inclusive training data in ASR development. Emphasizing that such data must be governed by strict ethical and privacy frameworks to protect sensitive child information. We hope that this study provides an initial step for future work on equitable speech technologies for Arabic-speaking children. We hope that our publicly available dataset enrich the children's demographic representation in ASR datasets.

artificial intelligence, machine learning, speech recognition, (14 more...)

arXiv.org Artificial Intelligence

2510.23319

Genre: Research Report (1.00)

Industry: Education > Educational Setting > Online (0.47)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

AI-Generated Text Detection in Low-Resource Languages: A Case Study on Urdu

Ammar, Muhammad, Hadi, Hadiya Murad, Butt, Usman Majeed

arXiv.org Artificial IntelligenceOct-21-2025

Large Language Models (LLMs) are now capable of generating text that closely resembles human writing, making them powerful tools for content creation, but this growing ability has also made it harder to tell whether a piece of text was written by a human or by a machine. This challenge becomes even more serious for languages like Urdu, where there are very few tools available to detect AI-generated text. To address this gap, we propose a novel AI-generated text detection framework tailored for the Urdu language. A balanced dataset comprising 1,800 humans authored, and 1,800 AI generated texts, sourced from models such as Gemini, GPT-4o-mini, and Kimi AI was developed. Detailed linguistic and statistical analysis was conducted, focusing on features such as character and word counts, vocabulary richness (Type Token Ratio), and N-gram patterns, with significance evaluated through t-tests and MannWhitney U tests. Three state-of-the-art multilingual transformer models such as mdeberta-v3-base, distilbert-base-multilingualcased, and xlm-roberta-base were fine-tuned on this dataset. The mDeBERTa-v3-base achieved the highest performance, with an F1-score 91.29 and accuracy of 91.26% on the test set. This research advances efforts in contesting misinformation and academic misconduct in Urdu-speaking communities and contributes to the broader development of NLP tools for low resource languages.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.16573

Country: Asia > Pakistan (0.16)

Genre: Research Report > Experimental Study (0.48)

Industry: Media > News (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Beyond WER: Probing Whisper's Sub-token Decoder Across Diverse Language Resource Levels

Liang, Siyu, Ballier, Nicolas, Levow, Gina-Anne, Wright, Richard

arXiv.org Artificial IntelligenceOct-1-2025

While large multilingual automatic speech recognition (ASR) models achieve remarkable performance, the internal mechanisms of the end-to-end pipeline, particularly concerning fairness and efficacy across languages, remain underexplored. This paper introduces a fine-grained analysis of Whisper's multilingual decoder, examining its sub-token hypotheses during transcription across languages with various resource levels. Our method traces the beam search path, capturing sub-token guesses and their associated probabilities. Results reveal that higher resource languages benefit from higher likelihood of the correct token being top-ranked, greater confidence, lower predictive entropy, and more diverse alternative candidates. Lower resource languages fare worse on these metrics, but also exhibit distinct clustering patterns in sub-token usage sometimes influenced by typology in our PCA and t-SNE analysis. This sub-token probing uncovers systematic decoding disparities masked by aggregate error rates and points towards targeted interventions to ameliorate the imbalanced development of speech technology.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2509.25516

Country:

North America > Mexico (0.28)
Europe > France (0.28)
Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Tokenization Strategies for Low-Resource Agglutinative Languages in Word2Vec: Case Study on Turkish and Finnish

Hu, Jinfan Frank

arXiv.org Artificial IntelligenceSep-30-2025

Tokenization plays a critical role in processing agglutinative languages, where a single word can encode multiple morphemes carrying syntactic and semantic information. This study evaluates the impact of various tokenization strategies - word-level, character-level, n-gram, and Byte Pair Encoding (BPE) - on the quality of static word embeddings generated by Word2Vec for Turkish and Finnish. Using a 10,000-article Wikipedia corpus, we trained models under low-resource conditions and evaluated them on a Named Entity Recognition (NER) task. Despite the theoretical appeal of subword segmentation, word-level tokenization consistently outperformed all alternatives across all tokenization strategies tested. These findings suggest that in agglutinative, low-resource contexts, preserving boundaries via word-level tokenization may yield better embedding performance than complex statistical methods. This has practical implications for developing NLP pipelines for under-resourced languages where annotated data and computing power are limited.

artificial intelligence, natural language, tokenization strategy, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ACDSA65407.2025.11165911

2509.14238

Country:

Europe (1.00)
North America > United States (0.94)
Asia > Middle East > Republic of Türkiye (0.68)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback