AITopics | bavarian

Collaborating Authors

bavarian

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Standard-to-Dialect Transfer Trends Differ across Text and Speech: A Case Study on Intent and Topic Classification in German Dialects

Blaschke, Verena, Winkler, Miriam, Plank, Barbara

arXiv.org Artificial IntelligenceOct-10-2025

Research on cross-dialectal transfer from a standard to a non-standard dialect variety has typically focused on text data. However, dialects are primarily spoken, and non-standard spellings are known to cause issues in text processing. We compare standard-to-dialect transfer in three settings: text models, speech models, and cascaded systems where speech first gets automatically transcribed and then further processed by a text model. In our experiments, we focus on German and multiple German dialects in the context of written and spoken intent and topic classification. To that end, we release the first dialectal audio intent classification dataset. We find that the speech-only setup provides the best results on the dialect data while the text-only setup works best on the standard data. While the cascaded systems lag behind the text-only models for German, they perform relatively well on the dialectal data if the transcription system generates normalized, standard-like output.

computational linguistic, natural language, text classification, (20 more...)

arXiv.org Artificial Intelligence

2510.0789

Country:

Europe (1.00)
North America > United States > Minnesota (0.28)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.86)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.84)

Add feedback

Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian

Hoffmann, Michael, John, Jophin, Schweter, Stefan, Ramakrishnan, Gokul, Mak, Hoi-Fong, Zhang, Alice, Gaynullin, Dmitry, Hammer, Nicolay J.

arXiv.org Artificial IntelligenceSep-9-2025

Built on Llama 3.1-8B and scaled to 10B parameters, Llama-GENBA-10B is continuously pretrained on 164B tokens (82B English, 82B German, and 80M Bavarian), balancing resources while preventing English dominance. Targeted at the German NLP community, the model also promotes Bavarian as a low-resource language. Development tackled four challenges: (1) curating a multilingual corpus despite Bavarian scarcity, (2) creating a unified tokenizer for English, German, and Bavarian, (3) optimizing architecture and language-ratio hyperparame-ters for cross-lingual transfer, and (4) establishing the first standardized trilingual evaluation suite by translating German benchmarks into Bavarian. Evaluations show that Llama-GENBA-10B achieves strong cross-lingual performance, with the fine-tuned variant surpassing Apertus-8B-2509 and gemma-2-9b in Bavarian and establishing itself as the best model in its class for this language, while also outperforming EuroLLM in English and matching its results in German. Training on the Cerebras CS-2 demonstrated efficient large-scale multilingual pretraining with documented energy use, offering a blueprint for inclusive foundation models that integrate low-resource languages.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2509.05668

Country:

Europe > Germany (0.15)
North America > United States (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Energy (0.49)
Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Evaluating Pixel Language Models on Non-Standardized Languages

Muñoz-Ortiz, Alberto, Blaschke, Verena, Plank, Barbara

arXiv.org Artificial IntelligenceDec-12-2024

We explore the potential of pixel-based models for transfer learning from standard languages to dialects. These models convert text into images that are divided into patches, enabling a continuous vocabulary representation that proves especially useful for out-of-vocabulary words common in dialectal data. Using German as a case study, we compare the performance of pixel-based models to token-based models across various syntactic and semantic tasks. Our results show that pixel-based models outperform token-based models in part-of-speech tagging, dependency parsing and intent detection for zero-shot dialect evaluation by up to 26 percentage points in some scenarios, though not in Standard German. However, pixel-based models fall short in topic classification. These findings emphasize the potential of pixel-based models for handling dialectal data, though further research should be conducted to assess their effectiveness in various linguistic contexts.

artificial intelligence, dialect, natural language, (15 more...)

arXiv.org Artificial Intelligence

2412.09084

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.05)
North America > Dominican Republic (0.04)
(10 more...)

Genre: Research Report > New Finding (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.89)

Add feedback

Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data

Peng, Siyao, Sun, Zihang, Shan, Huangyan, Kolm, Marie, Blaschke, Verena, Artemova, Ekaterina, Plank, Barbara

arXiv.org Artificial IntelligenceMar-19-2024

Named Entity Recognition (NER) is a fundamental task to extract key information from texts, but annotated resources are scarce for dialects. This paper introduces the first dialectal NER dataset for German, BarNER, with 161K tokens annotated on Bavarian Wikipedia articles (bar-wiki) and tweets (bar-tweet), using a schema adapted from German CoNLL 2006 and GermEval. The Bavarian dialect differs from standard German in lexical distribution, syntactic construction, and entity information. We conduct in-domain, cross-domain, sequential, and joint experiments on two Bavarian and three German corpora and present the first comprehensive NER results on Bavarian. Incorporating knowledge from the larger German NER (sub-)datasets notably improves on bar-wiki and moderately on bar-tweet. Inversely, training first on Bavarian contributes slightly to the seminal German CoNLL 2006 corpus. Moreover, with gold dialect labels on Bavarian tweets, we assess multi-task learning between five NER and two Bavarian-German dialect identification tasks and achieve NER SOTA on bar-wiki. We substantiate the necessity of our low-resource BarNER corpus and the importance of diversity in dialects, genres, and topics in enhancing model performance.

bavarian, dataset, proceedings, (14 more...)

arXiv.org Artificial Intelligence

2403.12749

Country:

Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
Europe > Austria (0.04)
(29 more...)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank

Blaschke, Verena, Kovačić, Barbara, Peng, Siyao, Schütze, Hinrich, Plank, Barbara

arXiv.org Artificial IntelligenceMar-15-2024

Despite the success of the Universal Dependencies (UD) project exemplified by its impressive language breadth, there is still a lack in `within-language breadth': most treebanks focus on standard languages. Even for German, the language with the most annotations in UD, so far no treebank exists for one of its language varieties spoken by over 10M people: Bavarian. To contribute to closing this gap, we present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in UD, covering multiple text genres (wiki, fiction, grammar examples, social, non-fiction). We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers' orthographies. Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries. We provide baseline parsing and POS tagging results, which are lower than results obtained on German and vary substantially between different graph-based parsers. To support further research on Bavarian syntax, we make our dataset, language-specific guidelines and code publicly available.

bavarian, computational linguistic, treebank, (13 more...)

arXiv.org Artificial Intelligence

2403.10293

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
Europe > Italy > Trentino-Alto Adige/Südtirol > South Tyrol (0.04)
Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.04)
(24 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

MaiBaam Annotation Guidelines

Blaschke, Verena, Kovačić, Barbara, Peng, Siyao, Plank, Barbara

arXiv.org Artificial IntelligenceMar-9-2024

This document provides annotation guidelines for MaiBaam, a Bavarian corpus annotated with part-of-speech (POS) tags and syntactic dependencies. MaiBaam belongs to the Universal Dependencies (UD) project (Zeman et al., 2023; de Marneffe et al., 2021), and our annotations elaborate on the general and German UD version 2 guidelines. This document is structured broadly in the order we prepare and annotate sentences: first, preprocessing and tokenization ( 1), then general recaps of POS tags ( 2) and dependencies ( 3), before we go into annotation decisions that would also apply to German ( 4) and lastly decisions that are specific to Bavarian grammar ( 5). Many examples are written in German, since the standardized orthography makes it easier to search this PDF. We only annotate UD-style POS tags (UPOS tags) and dependencies and add the SpaceAfter=No feature where appropriate, but do not add any other information (no lemma, XPOS tags, morphological features, enhanced dependencies or miscellaneous annotations). This document is primarily directed at present and future annotators of MaiBaam. We publish it to additionally allow others working with MaiBaam or annotating similar data to better understand the decisions we have made.

guideline, pronoun, treebank, (13 more...)

arXiv.org Artificial Intelligence

2403.05902

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
Europe > Germany > North Rhine-Westphalia > Upper Bavaria > Munich (0.04)
Europe > Spain > Aragón (0.04)
(3 more...)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback

Different Tastes of Entities: Investigating Human Label Variation in Named Entity Annotations

Peng, Siyao, Sun, Zihang, Loftus, Sebastian, Plank, Barbara

arXiv.org Artificial IntelligenceFeb-2-2024

Named Entity Recognition (NER) is a key information extraction task with a long-standing tradition. While recent studies address and aim to correct annotation errors via re-labeling efforts, little is known about the sources of human label variation, such as text ambiguity, annotation error, or guideline divergence. This is especially the case for high-quality datasets and beyond English CoNLL03. This paper studies disagreements in expert-annotated named entity datasets for three languages: English, Danish, and Bavarian. We show that text ambiguity and artificial guideline changes are dominant factors for diverse annotations among high-quality revisions. We survey student annotations on a subset of difficult entities and substantiate the feasibility and necessity of manifold annotations for understanding named entity ambiguities from a distributional perspective.

annotation, computational linguistic, disagreement, (12 more...)

arXiv.org Artificial Intelligence

2402.01423

Country:

North America > United States > New York (0.05)
North America > Canada > Ontario > Toronto (0.05)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
(21 more...)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Sports (0.94)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback

Low-resource Bilingual Dialect Lexicon Induction with Large Language Models

Artemova, Ekaterina, Plank, Barbara

arXiv.org Artificial IntelligenceApr-19-2023

Bilingual word lexicons are crucial tools for multilingual natural language understanding and machine translation tasks, as they facilitate the mapping of words in one language to their synonyms in another language. To achieve this, numerous papers have explored bilingual lexicon induction (BLI) in high-resource scenarios, using a typical pipeline consisting of two unsupervised steps: bitext mining and word alignment, both of which rely on pre-trained large language models~(LLMs). In this paper, we present an analysis of the BLI pipeline for German and two of its dialects, Bavarian and Alemannic. This setup poses several unique challenges, including the scarcity of resources, the relatedness of the languages, and the lack of standardization in the orthography of dialects. To evaluate the BLI outputs, we analyze them with respect to word frequency and pairwise edit distance. Additionally, we release two evaluation datasets comprising 1,500 bilingual sentence pairs and 1,000 bilingual word pairs. They were manually judged for their semantic similarity for each Bavarian-German and Alemannic-German language pair.

computational linguistic, large language model, natural language, (18 more...)

arXiv.org Artificial Intelligence

2304.09957

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(19 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.89)

Add feedback