AITopics | universal dependency

Collaborating Authors

universal dependency

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Towards Corpus-Grounded Agentic LLMs for Multilingual Grammatical Analysis

Klemen, Matej, Arčon, Tjaša, Terčon, Luka, Robnik-Šikonja, Marko, Dobrovoljc, Kaja

arXiv.org Artificial IntelligenceDec-2-2025

Empirical grammar research has become increasingly data-driven, but the systematic analysis of annotated corpora still requires substantial methodological and technical effort. We explore how agentic large language models (LLMs) can streamline this process by reasoning over annotated corpora and producing interpretable, data-grounded answers to linguistic questions. We introduce an agentic framework for corpus-grounded grammatical analysis that integrates concepts such as natural-language task interpretation, code generation, and data-driven reasoning. As a proof of concept, we apply it to Universal Dependencies (UD) corpora, testing it on multilingual grammatical tasks inspired by the World Atlas of Language Structures (WALS). The evaluation spans 13 word-order features and over 170 languages, assessing system performance across three complementary dimensions - dominant-order accuracy, order-coverage completeness, and distributional fidelity - which reflect how well the system generalizes, identifies, and quantifies word-order variations. The results demonstrate the feasibility of combining LLM reasoning with structured linguistic data, offering a first step toward interpretable, scalable automation of corpus-based grammatical inquiry.

accuracy, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2512.00214

Country: Europe > Slovenia (0.15)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Evaluating Morphological Alignment of Tokenizers in 70 Languages

Arnett, Catherine, Hudspeth, Marisa, O'Connor, Brendan

arXiv.org Artificial IntelligenceJul-10-2025

While tokenization is a key step in language modeling, with effects on model training and performance, it remains unclear how to effectively evaluate tokenizer quality. One proposed dimension of tokenizer quality is the extent to which tokenizers preserve linguistically meaningful subwords, aligning token boundaries with morphological boundaries within a word. We expand MorphScore (Arnett & Bergen, 2025), which previously covered 22 languages, to support a total of 70 languages. The updated MorphScore offers more flexibility in evaluation and addresses some of the limitations of the original version. We then correlate our alignment scores with downstream task performance for five pre-trained languages models on seven tasks, with at least one task in each of the languages in our sample. We find that morphological alignment does not explain very much variance in model performance, suggesting that morphological alignment alone does not measure dimensions of tokenization quality relevant to model performance.

computational linguistic, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2507.06378

Country:

Europe (1.00)
North America > United States (0.92)
Asia > Middle East > UAE (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.46)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.35)

Add feedback

Building UD Cairo for Old English in the Classroom

Levine, Lauren, Min, Junghyun, Zeldes, Amir

arXiv.org Artificial IntelligenceJun-13-2025

In this paper we present a sample treebank for Old English based on the UD Cairo sentences, collected and annotated as part of a classroom curriculum in Historical Linguistics. To collect the data, a sample of 20 sentences illustrating a range of syntactic constructions in the world's languages, we employ a combination of LLM prompting and searches in authentic Old English data. For annotation we assigned sentences to multiple students with limited prior exposure to UD, whose annotations we compare and adjudicate. Our results suggest that while current LLM outputs in Old English do not reflect authentic syntax, this can be mitigated by post-editing, and that although beginner annotators do not possess enough background to complete the task perfectly, taken together they can produce good results and learn from the experience. We also conduct preliminary parsing experiments using Modern English training data, and find that although performance on Old English is poor, parsing on annotated features (lemma, hyperlemma, gloss) leads to improved performance.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2504.18718

Country:

North America > United States (1.00)
Europe (1.00)
Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.65)

Genre: Research Report > New Finding (0.68)

Industry: Education > Educational Setting (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.57)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.55)

Add feedback

Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages

Kellert, Olga, Tyagi, Nemika, Imran, Muhammad, Licona-Guevara, Nelvin, Gómez-Rodríguez, Carlos

arXiv.org Artificial IntelligenceJun-10-2025

Code-switching presents a complex challenge for syntactic analysis, especially in low-resource language settings where annotated data is scarce. While recent work has explored the use of large language models (LLMs) for sequence-level tagging, few approaches systematically investigate how well these models capture syntactic structure in code-switched contexts. Moreover, existing parsers trained on monolingual treebanks often fail to generalize to multilingual and mixed-language input. To address this gap, we introduce the BiLingua Parser, an LLM-based annotation pipeline designed to produce Universal Dependencies (UD) annotations for code-switched text. First, we develop a prompt-based framework for Spanish-English and Spanish-Guaraní data, combining few-shot LLM prompting with expert review. Second, we release two annotated datasets, including the first Spanish-Guaraní UD-parsed corpus. Third, we conduct a detailed syntactic analysis of switch points across language pairs and communicative contexts. Experimental results show that BiLingua Parser achieves up to 95.29% LAS after expert revision, significantly outperforming prior baselines and multilingual parsers. These results show that LLMs, when carefully guided, can serve as practical tools for bootstrapping syntactic resources in under-resourced, code-switched environments. Data and source code are available at https://github.com/N3mika/ParsingProject

annotation, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.07274

Country:

North America > United States (1.00)
Europe (0.93)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback

A UD Treebank for Bohairic Coptic

Zeldes, Amir, Speransky, Nina, Wagner, Nicholas, Schroeder, Caroline T.

arXiv.org Artificial IntelligenceJun-10-2025

Despite recent advances in digital resources for other Coptic dialects, especially Sahidic, Bohairic Coptic, the main Coptic dialect for pre-Mamluk, late Byzantine Egypt, and the contemporary language of the Coptic Church, remains critically under-resourced. This paper presents and evaluates the first syntactically annotated corpus of Bohairic Coptic, sampling data from a range of works, including Biblical text, saints' lives and Christian ascetic writing. We also explore some of the main differences we observe compared to the existing UD treebank of Sahidic Coptic, the classical dialect of the language, and conduct joint and cross-dialect parsing experiments, revealing the unique nature of Bohairic as a related, but distinct variety from the more often studied Sahidic.

artificial intelligence, natural language, treebank, (17 more...)

arXiv.org Artificial Intelligence

2504.18386

Country:

North America > United States (0.68)
Europe > Germany (0.68)
Africa > Middle East > Egypt (0.49)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback

Evaluating the Effectiveness of Linguistic Knowledge in Pretrained Language Models: A Case Study of Universal Dependencies

Li, Wenxi

arXiv.org Artificial IntelligenceJun-6-2025

Universal Dependencies (UD), while widely regarded as the most successful linguistic framework for cross-lingual syntactic representation, remains underexplored in terms of its effectiveness. This paper addresses this gap by integrating UD into pretrained language models and assesses if UD can improve their performance on a cross-lingual adversarial paraphrase identification task. Experimental results show that incorporation of UD yields significant improvements in accuracy and $F_1$ scores, with average gains of 3.85\% and 6.08\% respectively. These enhancements reduce the performance gap between pretrained models and large language models in some language pairs, and even outperform the latter in some others. Furthermore, the UD-based similarity score between a given language and English is positively correlated to the performance of models in that language. Both findings highlight the validity and potential of UD in out-of-domain tasks.

computational linguistic, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2506.04887

Country:

Europe (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.50)

Add feedback

The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project

Aquino, Angelina A., Miranda, Lester James V., Or, Elsie Marie T.

arXiv.org Artificial IntelligenceMay-28-2025

This paper presents UD-NewsCrawl, the largest Tagalog treebank to date, containing 15.6k trees manually annotated according to the Universal Dependencies framework. We detail our treebank development process, including data collection, pre-processing, manual annotation, and quality assurance procedures. We provide baseline evaluations using multiple transformer-based models to assess the performance of state-of-the-art dependency parsers on Tagalog. We also highlight challenges in the syntactic analysis of Tagalog given its distinctive grammatical properties, and discuss its implications for the annotation of this treebank. We anticipate that UD-NewsCrawl and our baseline model implementations will serve as valuable resources for advancing computational linguistics research in underrepresented languages like Tagalog.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2505.20428

Country:

Europe (1.00)
Asia > Middle East (0.67)
North America > United States > Minnesota (0.28)
Asia > Philippines > Luzon > National Capital Region > City of Manila (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine (0.93)
Leisure & Entertainment > Sports > Basketball (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Enhancing Korean Dependency Parsing with Morphosyntactic Features

Park, Jungyeul, Chen, Yige, Kim, Kyuwon, Lim, KyungTae, Park, Chulwoo

arXiv.org Artificial IntelligenceMar-26-2025

This paper introduces UniDive for Korean, an integrated framework that bridges Universal Dependencies (UD) and Universal Morphology (UniMorph) to enhance the representation and processing of Korean {morphosyntax}. Korean's rich inflectional morphology and flexible word order pose challenges for existing frameworks, which often treat morphology and syntax separately, leading to inconsistencies in linguistic analysis. UniDive unifies syntactic and morphological annotations by preserving syntactic dependencies while incorporating UniMorph-derived features, improving consistency in annotation. We construct an integrated dataset and apply it to dependency parsing, demonstrating that enriched morphosyntactic features enhance parsing accuracy, particularly in distinguishing grammatical relations influenced by morphology. Our experiments, conducted with both encoder-only and decoder-only models, confirm that explicit morphological information contributes to more accurate syntactic analysis.

annotation, artificial intelligence, natural language, (16 more...)

arXiv.org Artificial Intelligence

2503.21029

Country:

Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
Asia > China > Hong Kong (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(10 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback

Second language Korean Universal Dependency treebank v1.2: Focus on data augmentation and annotation scheme refinement

Sung, Hakyung, Shin, Gyu-Ho

arXiv.org Artificial IntelligenceMar-18-2025

We expand the second language (L2) Korean Universal Dependencies (UD) treebank with 5,454 manually annotated sentences. The annotation guidelines are also revised to better align with the UD framework. Using this enhanced treebank, we fine-tune three Korean language models and evaluate their performance on in-domain and out-of-domain L2-Korean datasets. The results show that fine-tuning significantly improves their performance across various metrics, thus highlighting the importance of using well-tailored L2 datasets for fine-tuning first-language-based, general-purpose language models for the morphosyntactic analysis of L2 data.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.14718

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
North America > United States > Oregon (0.04)
(4 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Syntactic Transfer to Kyrgyz Using the Treebank Translation Method

Alekseev, Anton, Tillabaeva, Alina, Kabaeva, Gulnara Dzh., Nikolenko, Sergey I.

arXiv.org Artificial IntelligenceDec-17-2024

The Kyrgyz language, as a low-resource language, requires significant effort to create high-quality syntactic corpora. This study proposes an approach to simplify the development process of a syntactic corpus for Kyrgyz. We present a tool for transferring syntactic annotations from Turkish to Kyrgyz based on a treebank translation method. The effectiveness of the proposed tool was evaluated using the TueCL treebank. The results demonstrate that this approach achieves higher syntactic annotation accuracy compared to a monolingual model trained on the Kyrgyz KTMU treebank. Additionally, the study introduces a method for assessing the complexity of manual annotation for the resulting syntactic trees, contributing to further optimization of the annotation process.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2412.13146

Country:

Asia > Russia (0.05)
Europe > Russia > Northwestern Federal District > Leningrad Oblast > Saint Petersburg (0.04)
Asia > Kyrgyzstan > Chüy Region > Bishkek (0.04)
(8 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)

Add feedback