AITopics | dialect

Collaborating Authors

dialect

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

From chirps to 'hellos': Why some birds talk like people

Popular ScienceMar-8-2026, 12:01:00 GMT

From chirps to'hellos': Why some birds talk like people Brains, bonds, and a strange voice box help some birds mimic our speech. Budgies (which is short for budgerigar) are actually a specific kind of parakeet. These birds are excellent communicators. Breakthroughs, discoveries, and DIY tips sent six days a week. In 1995, a California parakeet earned the Guinness World Record for having the largest human vocabulary among birds.

artificial intelligence, parrot, wright, (11 more...)

Popular Science

Country:

North America > United States > California > San Francisco County > San Francisco (0.15)
Oceania > New Zealand (0.05)
Oceania > Australia > South Australia (0.05)
(7 more...)

Genre: Research Report > New Finding (0.35)

Industry: Media > Photography (0.32)

Technology: Information Technology > Artificial Intelligence (0.50)

Add feedback

Language Model Tokenizers Introduce Unfairness Between Languages

Neural Information Processing SystemsFeb-14-2026, 13:54:12 GMT

Recent language models have shown impressive multilingual performance, even when not explicitly trained for it. Despite this, there are concerns about the quality of their outputs across different languages. In this paper, we show how disparity in the treatment of different languages arises at the tokenization stage, well before a model is even invoked. The same text translated into different languages can have drastically different tok-enization lengths, with differences up to 15 times in some cases. These disparities persist even for tokenizers that are intentionally trained for multilingual support.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > Haiti (0.14)
Asia > Philippines > Luzon > Ilocos Region > Province of Pangasinan (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
(38 more...)

Genre: Overview (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Add feedback

AI language models show bias against regional German dialects

AIHubDec-8-2025, 10:36:34 GMT

This is shown by a recent collaborative study between Johannes Gutenberg University Mainz (JGU) and the universities of Hamburg and Washington. The results, presented at this year's Conference on Empirical Methods in Natural Language Processing (EMNLP) - one of the world's leading conferences in computational linguistics - show that all tested AI systems reproduce social stereotypes. "Dialects are an essential part of cultural identity," emphasized Minh Duc Bui, a doctoral researcher in von der Wense's Natural Language Processing (NLP) group at JGU's Institute of Computer Science. "Our analyses suggest that language models associate dialects with negative traits - thereby perpetuating problematic social biases." Using linguistic databases containing orthographic and phonetic variants of German dialects, the team first translated seven regional varieties into Standard German.

artificial intelligence, dialect, natural language, (12 more...)

AIHub

Country:

Europe > Germany > Rheinland-Pfalz > Mainz (0.30)
North America > Canada (0.05)
Europe > Finland > Uusimaa > Helsinki (0.05)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.32)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.31)

Add feedback

Bootstrapping Fuzzers for Compilers of Low-Resource Language Dialects Using Language Models

Vaidya, Sairam, Böhme, Marcel, D'Antoni, Loris

arXiv.org Artificial IntelligenceDec-8-2025

Modern extensible compiler frameworks-such as MLIR-enable rapid creation of domain-specific language dialects. This flexibility, however, makes correctness harder to ensure as the same extensibility that accelerates development also complicates maintaining the testing infrastructure. Extensible languages require automated test generation that is both dialect-agnostic (works across dialects without manual adaptation) and dialect-effective (targets dialect-specific features to find bugs). Existing approaches typically sacrifice one of these goals by either requiring manually constructed seed corpora for each dialect, or by failing to be effective. We present a dialect-agnostic and dialect-effective grammar-based and coverage-guided fuzzing approach for extensible compilers that combines two key insights from existing work: (i) the grammars of dialects, which already encode the structural and type constraints, can often be extracted automatically from the dialect specification; and (ii) these grammars can be used in combination with pre-trained large language models to automatically generate representative and diverse seed inputs from the full dialect space without requiring any manual input or training data. These seeds can then be used to bootstrap coverage-guided fuzzers. We built this approach into a tool, Germinator. When evaluated on six MLIR projects spanning 91 dialects, Germinator generated seeds improve line coverage by 10-120% over grammar-based baselines. We compare against grammar-based baselines because they are the only class of existing automatic seed generators that can be applied uniformly across MLIR's heterogeneous dialect ecosystem. Germinator discovers 88 previously unknown bugs (40 confirmed), including 23 in dialects with no prior automated test generators, demonstrating effective and controllable testing of low-resource dialects at scale.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2512.05887

Country:

North America > United States > California > San Diego County > San Diego (0.40)
Europe > Austria > Vienna (0.14)
North America > United States > New York > New York County > New York City (0.04)
(12 more...)

Genre: Research Report (0.83)

Industry: Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.51)

Add feedback

Developing an Open Conversational Speech Corpus for the Isan Language

Na-Thalang, Adisai, Wittayasakpan, Chanakan, Phatcharoen, Kritsadha, Buakaw, Supakit

arXiv.org Artificial IntelligenceDec-5-2025

This paper introduces the development of the first open conversational speech dataset for the Isan language, the most widely spoken regional dialect in Thailand. Unlike existing speech corpora that are primarily based on read or scripted speech, this dataset consists of natural speech, thereby capturing authentic linguistic phenomena such as colloquials, spontaneous prosody, disfluencies, and frequent code-switching with central Thai. A key challenge in building this resource lies in the lack of a standardized orthography for Isan. Current writing practices vary considerably, due to the different lexical tones between Thai and Isan. This variability complicates the design of transcription guidelines and poses questions regarding consistency, usability, and linguistic authenticity. To address these issues, we establish practical transcription protocols that balance the need for representational accuracy with the requirements of computational processing. By releasing this dataset as an open resource, we aim to contribute to inclusive AI development, support research on underrepresented languages, and provide a basis for addressing the linguistic and technical challenges inherent in modeling conversational speech.

artificial intelligence, natural language, thatphithakkul, (9 more...)

arXiv.org Artificial Intelligence

2511.21229

Country: Asia > Thailand (0.24)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.67)
Information Technology > Artificial Intelligence > Speech (0.46)

Add feedback

Computational Linguistics Meets Libyan Dialect: A Study on Dialect Identification

Essgaer, Mansour, Massud, Khamis, Mamlook, Rabia Al, Ghmaid, Najah

arXiv.org Artificial IntelligenceDec-5-2025

This study investigates logistic regression, linear support vector machine, multinomial Naive Bayes, and Bernoulli Naive Bayes for classifying Libyan dialect utterances gathered from Twitter. The dataset used is the QADI corpus, which consists of 540,000 sentences across 18 Arabic dialects. Preprocessing challenges include handling inconsistent orthographic variations and non-standard spellings typical of the Libyan dialect. The chi-square analysis revealed that certain features, such as email mentions and emotion indicators, were not significantly associated with dialect classification and were thus excluded from further analysis. Two main experiments were conducted: (1) evaluating the significance of meta-features extracted from the corpus using the chi-square test and (2) assessing classifier performance using different word and character n-gram representations. The classification experiments showed that Multinomial Naive Bayes (MNB) achieved the highest accuracy of 85.89% and an F1-score of 0.85741 when using a (1,2) word n-gram and (1,5) character n-gram representation. In contrast, Logistic Regression and Linear SVM exhibited slightly lower performance, with maximum accuracies of 84.41% and 84.73%, respectively. Additional evaluation metrics, including log loss, Cohen kappa, and Matthew correlation coefficient, further supported the effectiveness of MNB in this task. The results indicate that carefully selected n-gram representations and classification models play a crucial role in improving the accuracy of Libyan dialect identification. This study provides empirical benchmarks and insights for future research in Arabic dialect NLP applications.

artificial intelligence, dialect, machine learning, (11 more...)

arXiv.org Artificial Intelligence

doi: 10.5815/ijisa.2025.06.09

2512.04257

Country:

Asia > Malaysia (0.04)
Africa > Middle East > Libya > Sabha District > Sabha (0.04)
North America > United States > Michigan (0.04)
(7 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

A Very Big Fight Over a Very Small Language

The New YorkerDec-1-2025, 11:00:00 GMT

In the Swiss Alps, a plan to tidy up Romansh--spoken by less than one per cent of the country--set off a decades-long quarrel over identity, belonging, and the sound of authenticity. After reformers launched Rumantsch Grischun, a standardized version of Romansh's various dialects, traditionalists denounced it as a "bastard," a "castrated" tongue, an act of "linguistic murder." Ask him how it all began, and he remembers the ice. It was a bitter morning in January, 1982, when Bernard Cathomas, aged thirty-six, carefully picked his way up a slippery, sloping Zurich street. His destination was No. 33, an ochre house with green shutters--the home of Heinrich Schmid, a linguist at the University of Zurich. Inside, the décor suggested that "professor" was an encompassing identity: old wooden floors, a faded carpet, a living room seemingly untouched since the nineteen-thirties, when Schmid had grown up in the house. Schmid's wife served, a Swiss carrot cake that manages bourgeois indulgence with a vegetable alibi. Cathomas had already written from Chur, in the canton of the Grisons, having recently become the general secretary of the Lia Rumantscha, a small association charged with protecting Switzerland's least known national language, Romansh. Spoken by less than one per cent of the Swiss population, the language was itself splintered into five major "idioms," not always readily intelligible to one another, each with its own spelling conventions. Earlier attempts at unification had collapsed in rivalries. In his letter, Cathomas said that Schmid's authority would be valuable in standardizing the language. Cathomas wrote in German but started and ended in his native Sursilvan, the biggest of the Romansh idioms: " ." Translation: "I thank you very much for your interest and attention to this problem." Schmid, the man he was counting on, hadn't grown up speaking Romansh; he first learned it in high school, and later worked on the "Dicziunari Rumantsch Grischun," a Romansh dictionary begun in 1904 and still lumbering toward completion.

artificial intelligence, cathoma, natural language, (17 more...)

The New Yorker

Country:

Europe > Switzerland > Zürich > Zürich (0.45)
North America > United States > Texas (0.04)
North America > United States > New York (0.04)
(6 more...)

Genre: Personal (0.68)

Industry:

Media (0.93)
Government (0.93)
Education > Educational Setting > K-12 Education > Secondary School (0.48)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

LLMs for Low-Resource Dialect Translation Using Context-Aware Prompting: A Case Study on Sylheti

Prama, Tabia Tanzin, Danforth, Christopher M., Dodds, Peter Sheridan

arXiv.org Artificial IntelligenceDec-1-2025

Large Language Models (LLMs) have demonstrated strong translation abilities through prompting, even without task-specific training. However, their effectiveness in dialectal and low-resource contexts remains underexplored. This study presents the first systematic investigation of LLM-based machine translation (MT) for Sylheti, a dialect of Bangla that is itself low-resource. We evaluate five advanced LLMs (GPT-4.1, GPT-4.1, LLaMA 4, Grok 3, and DeepSeek V3.2) across both translation directions (Bangla $\Leftrightarrow$ Sylheti), and find that these models struggle with dialect-specific vocabulary. To address this, we introduce Sylheti-CAP (Context-Aware Prompting), a three-step framework that embeds a linguistic rulebook, a dictionary (2{,}260 core vocabulary items and idioms), and an authenticity check directly into prompts. Extensive experiments show that Sylheti-CAP consistently improves translation quality across models and prompting strategies. Both automatic metrics and human evaluations confirm its effectiveness, while qualitative analysis reveals notable reductions in hallucinations, ambiguities, and awkward phrasing, establishing Sylheti-CAP as a scalable solution for dialectal and low-resource MT. Dataset link: \href{https://github.com/TabiaTanzin/LLMs-for-Low-Resource-Dialect-Translation-Using-Context-Aware-Prompting-A-Case-Study-on-Sylheti.git}{https://github.com/TabiaTanzin/LLMs-for-Low-Resource-Dialect-Translation-Using-Context-Aware-Prompting-A-Case-Study-on-Sylheti.git}

large language model, machine learning, translation, (18 more...)

arXiv.org Artificial Intelligence

2511.21761

Country:

North America > United States > Vermont > Chittenden County > Burlington (0.14)
North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Using MLIR Transform to Design Sliced Convolution Algorithm

Ferrari, Victor, Pereira, Marcio, Alvarenga, Lucas, Leite, Gustavo, Araujo, Guido

arXiv.org Artificial IntelligenceNov-25-2025

This paper proposes SConvTransform, a Transform dialect extension that provides operations for optimizing 2D convolutions in MLIR. Its main operation, SConvOp, lowers Linalg convolutions into tiled and packed generic operations through a fully declarative transformation pipeline. The process is guided by a Convolution Slicing Analysis that determines tile sizes and data layout strategies based on input and filter shapes, as well as target architecture parameters. SConvOp handles edge cases by splitting irregular regions and adjusting affine maps where needed. All packing and tiling operations are derived from a parametric set of affine equations, enabling reusable and analyzable transformations. Although functional correctness was the primary goal of this work, the experimental evaluation demonstrates the effectiveness of SConvTransform, achieving good enough performance across different target architectures. Future work will focus on optimizing performance and porting to other target devices. When applied to standard convolution configurations, the generated code achieves up to 60% of peak performance on ARM SME and 67% on Intel AVX512. These results validate the benefit of combining static shape analysis with structured tiling and packing strategies within the MLIR Transform dialect. Furthermore, the modular design of SConvTransform facilitates integration with future extensions, enabling continued optimization of convolution workloads through MLIR's extensible compilation infrastructure.

artificial intelligence, machine learning, opération, (18 more...)

arXiv.org Artificial Intelligence

2511.18222

Country:

South America > Brazil > São Paulo > Campinas (0.05)
North America > United States > New York > New York County > New York City (0.04)

Genre:

Research Report (0.64)
Workflow (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

How Well Do LLMs Understand Tunisian Arabic?

Mahdi, Mohamed

arXiv.org Artificial IntelligenceNov-24-2025

Large Language Models (LLMs) are the engines driving today's AI agents. The better these models understand human languages, the more natural and user-friendly the interaction with AI becomes, from everyday devices like computers and smartwatches to any tool that can act intelligently. Yet, the ability of industrial-scale LLMs to comprehend low-resource languages, such as Tunisian Arabic (Tunizi), is often overlooked. This neglect risks excluding millions of Tunisians from fully interacting with AI in their own language, pushing them toward French or English. Such a shift not only threatens the preservation of the Tunisian dialect but may also create challenges for literacy and influence younger generations to favor foreign languages. In this study, we introduce a novel dataset containing parallel Tunizi, standard Tunisian Arabic, and English translations, along with sentiment labels. We benchmark several popular LLMs on three tasks: transliteration, translation, and sentiment analysis. Our results reveal significant differences between models, highlighting both their strengths and limitations in understanding and processing Tunisian dialects. By quantifying these gaps, this work underscores the importance of including low-resource languages in the next generation of AI systems, ensuring technology remains accessible, inclusive, and culturally grounded.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.16683

Country: Africa > Middle East > Tunisia > Tunis Governorate > Tunis (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback