Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data
Peng, Siyao, Sun, Zihang, Shan, Huangyan, Kolm, Marie, Blaschke, Verena, Artemova, Ekaterina, Plank, Barbara
–arXiv.org Artificial Intelligence
Named Entity Recognition (NER) is a fundamental task to extract key information from texts, but annotated resources are scarce for dialects. This paper introduces the first dialectal NER dataset for German, BarNER, with 161K tokens annotated on Bavarian Wikipedia articles (bar-wiki) and tweets (bar-tweet), using a schema adapted from German CoNLL 2006 and GermEval. The Bavarian dialect differs from standard German in lexical distribution, syntactic construction, and entity information. We conduct in-domain, cross-domain, sequential, and joint experiments on two Bavarian and three German corpora and present the first comprehensive NER results on Bavarian. Incorporating knowledge from the larger German NER (sub-)datasets notably improves on bar-wiki and moderately on bar-tweet. Inversely, training first on Bavarian contributes slightly to the seminal German CoNLL 2006 corpus. Moreover, with gold dialect labels on Bavarian tweets, we assess multi-task learning between five NER and two Bavarian-German dialect identification tasks and achieve NER SOTA on bar-wiki. We substantiate the necessity of our low-resource BarNER corpus and the importance of diversity in dialects, genres, and topics in enhancing model performance.
arXiv.org Artificial Intelligence
Mar-19-2024
- Country:
- Oceania > Australia (0.04)
- Africa (0.04)
- North America
- United States
- Washington > King County
- Seattle (0.04)
- New York > New York County
- New York City (0.04)
- New Mexico > Santa Fe County
- Santa Fe (0.04)
- Michigan > Washtenaw County
- Ann Arbor (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- California > Los Angeles County
- Los Angeles (0.04)
- Washington > King County
- Canada
- Ontario > Toronto (0.04)
- British Columbia > Metro Vancouver Regional District
- Vancouver (0.04)
- United States
- Europe
- Austria (0.04)
- Slovenia (0.04)
- Iceland > Capital Region
- Reykjavik (0.04)
- Germany
- North Rhine-Westphalia
- Upper Bavaria > Munich (0.04)
- Düsseldorf Region > Düsseldorf (0.04)
- Bavaria > Upper Bavaria
- Munich (0.05)
- North Rhine-Westphalia
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.05)
- Spain
- Valencian Community > Valencia Province
- Valencia (0.04)
- Catalonia > Barcelona Province
- Barcelona (0.04)
- Valencian Community > Valencia Province
- Denmark > Capital Region
- Copenhagen (0.04)
- Bulgaria > Sofia City Province
- Sofia (0.04)
- Italy > Trentino-Alto Adige/Südtirol
- South Tyrol (0.04)
- Faroe Islands > Streymoy
- Tórshavn (0.04)
- United Kingdom > Scotland
- City of Edinburgh > Edinburgh (0.04)
- Estonia > Tartu County
- Tartu (0.04)
- Asia
- South Korea > Gyeonggi-do
- Suwon (0.04)
- Middle East
- Japan
- Kyūshū & Okinawa > Kyūshū
- Miyazaki Prefecture > Miyazaki (0.04)
- Honshū > Kansai
- Osaka Prefecture > Osaka (0.04)
- Kyūshū & Okinawa > Kyūshū
- China
- South Korea > Gyeonggi-do
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Information Technology (0.46)
- Technology: