AITopics | Grammars & Parsing

Collaborating Authors

Grammars & Parsing

News Overviews Instructional Materials AI-Alerts Classics

Citation Parsing and Analysis with Language Models

arXiv.org Artificial IntelligenceMay-23-2025

A key type of resource needed to address global inequalities in knowledge production and dissemination is a tool that can support journals in understanding how knowledge circulates. The absence of such a tool has resulted in comparatively less information about networks of knowledge sharing in the Global South. In turn, this gap authorizes the exclusion of researchers and scholars from the South in indexing services, reinforcing colonial arrangements that de-center and minoritize those scholars. In order to support citation network tracking on a global scale, we investigate the capacity of open-weight language models to mark up manuscript citations in an indexable format. We assembled a dataset of matched plaintext and annotated citations from preprints and published research papers. Then, we evaluated a number of open-weight language models on the annotation task. We find that, even out of the box, today's language models achieve high levels of accuracy on identifying the constituent components of each citation, outperforming state-of-the-art methods. Moreover, the smallest model we evaluated, Qwen3-0.6B, can parse all fields with high accuracy in $2^5$ passes, suggesting that post-training is likely to be effective in producing small, robust citation parsing models. Such a tool could greatly improve the fidelity of citation networks and thus meaningfully improve research indexing and discovery, as well as further metascientific research.

artificial intelligence, large language model, natural language, (15 more...)

arXiv.org Artificial Intelligence

2505.15948

Country: North America > United States > California (0.28)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (0.97)
Government > Regional Government > North America Government > United States Government (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.62)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)

Add feedback

Transfer of Structural Knowledge from Synthetic Languages

Budnikov, Mikhail, Yamshchikov, Ivan

arXiv.org Artificial IntelligenceMay-22-2025

This work explores transfer learning from several synthetic languages to English. We investigate the structure of the embeddings in the fine-tuned models, the information they contain, and the capabilities of the fine-tuned models on simple linguistic tasks. We also introduce a new synthetic language that leads to better transfer to English than the languages used in previous research. Finally, we introduce Tiny-Cloze Benchmark - a new synthetic benchmark for natural language understanding that is more informative for less powerful models. We use Tiny-Cloze Benchmark to evaluate fine-tuned models in several domains demonstrating that fine-tuning on a new synthetic language allows for better performance on a variety of tasks.

large language model, machine learning, synthetic language, (17 more...)

arXiv.org Artificial Intelligence

2505.15769

Country: Europe > Germany (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Semantic-based Unsupervised Framing Analysis (SUFA): A Novel Approach for Computational Framing Analysis

Ali, Mohammad, Hassan, Naeemul

arXiv.org Artificial IntelligenceMay-22-2025

This research presents a novel approach to computational framing analysis, called Semantic Relations-based Unsupervised Framing Analysis (SUFA). SUFA leverages semantic relations and dependency parsing algorithms to identify and assess entity-centric emphasis frames in news media reports. This innovative method is derived from two studies -- qualitative and computational -- using a dataset related to gun violence, demonstrating its potential for analyzing entity-centric emphasis frames. This article discusses SUFA's strengths, limitations, and application procedures. Overall, the SUFA approach offers a significant methodological advancement in computational framing analysis, with its broad applicability across both the social sciences and computational domains.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2505.15563

Country: North America > United States > Texas (0.32)

Genre:

Research Report > Experimental Study (0.93)
Research Report > Promising Solution (0.90)
Research Report > New Finding (0.68)
Overview > Innovation (0.61)

Industry:

Media > News (1.00)
Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.91)

Add feedback

Neural Morphological Tagging for Nguni Languages

Marquard, Cael, Mawere, Simbarashe, Meyer, Francois

arXiv.org Artificial IntelligenceMay-20-2025

Morphological parsing is the task of decomposing words into morphemes, the smallest units of meaning in a language, and labelling their grammatical roles. It is a particularly challenging task for agglutinative languages, such as the Nguni languages of South Africa, which construct words by concatenating multiple morphemes. A morphological parsing system can be framed as a pipeline with two separate components, a segmenter followed by a tagger. This paper investigates the use of neural methods to build morphological taggers for the four Nguni languages. We compare two classes of approaches: training neural sequence labellers (LSTMs and neural CRFs) from scratch and finetuning pretrained language models. We compare performance across these two categories, as well as to a traditional rule-based morphological parser. Neural taggers comfortably outperform the rule-based baseline and models trained from scratch tend to outperform pretrained models. We also compare parsing results across different upstream segmenters and with varying linguistic input features. Our findings confirm the viability of employing neural taggers based on pre-existing morphological segmenters for the Nguni languages.

machine learning, natural language, segmentation, (19 more...)

arXiv.org Artificial Intelligence

2505.12949

Country:

Europe (1.00)
Africa (1.00)
North America > United States > California (0.46)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Neuro-Symbolic Query Compiler

Zhang, Yuyao, Dou, Zhicheng, Li, Xiaoxi, Jin, Jiajie, Wu, Yongkang, Li, Zhonghua, Ye, Qi, Wen, Ji-Rong

arXiv.org Artificial IntelligenceMay-20-2025

Precise recognition of search intent in Retrieval-Augmented Generation (RAG) systems remains a challenging goal, especially under resource constraints and for complex queries with nested structures and dependencies. This paper presents QCompiler, a neuro-symbolic framework inspired by linguistic grammar rules and compiler design, to bridge this gap. It theoretically designs a minimal yet sufficient Backus-Naur Form (BNF) grammar $G[q]$ to formalize complex queries. Unlike previous methods, this grammar maintains completeness while minimizing redundancy. Based on this, QCompiler includes a Query Expression Translator, a Lexical Syntax Parser, and a Recursive Descent Processor to compile queries into Abstract Syntax Trees (ASTs) for execution. The atomicity of the sub-queries in the leaf nodes ensures more precise document retrieval and response generation, significantly improving the RAG system's ability to address complex queries.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.11932

Country: North America (0.15)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.89)
(2 more...)

Add feedback

Reassessing Graph Linearization for Sequence-to-sequence AMR Parsing: On the Advantages and Limitations of Triple-Based Encoding

Kang, Jeongwoo, Coavoux, Maximin, Lopez, Cédric, Schwab, Didier

arXiv.org Artificial IntelligenceMay-14-2025

Sequence-to-sequence models are widely used to train Abstract Meaning Representation (Banarescu et al., 2013, AMR) parsers. To train such models, AMR graphs have to be linearized into a one-line text format. While Penman encoding is typically used for this purpose, we argue that it has limitations: (1) for deep graphs, some closely related nodes are located far apart in the linearized text (2) Penman's tree-based encoding necessitates inverse roles to handle node re-entrancy, doubling the number of relation types to predict. To address these issues, we propose a triple-based linearization method and compare its efficiency with Penman linearization. Although triples are well suited to represent a graph, our results suggest room for improvement in triple encoding to better compete with Penman's concise and explicit representation of a nested graph structure.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2505.08504

Country:

Asia > China (0.06)
Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.05)
Europe > Bulgaria > Sofia City Province > Sofia (0.05)
(7 more...)

Genre: Research Report > New Finding (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.61)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.53)

Add feedback

Graph Laplacian Wavelet Transformer via Learnable Spectral Decomposition

Kiruluta, Andrew, Lundy, Eric, Burity, Priscilla

arXiv.org Artificial IntelligenceMay-14-2025

We introduce the Graph W avelet Transformer (GWT), a novel architecture that replaces this bottleneck with a learnable, multi-scale wavelet transform defined over an explicit graph Laplacian derived from syntactic or semantic parses. By parameterizing K N bandpass filters in the graph Fourier domain, GWT achieves a linear-time mixing operator that simultaneously captures local syntactic dependencies and global semantic context. We provide a rigorous mathematical formulation of the spectral filtering and mixing process, integrate GWT modules into a standard Graph Transformer backbone, and evaluate on the WMT14 English-German translation benchmark. Empirical results demonstrate that GWT outperforms the baseline Graph Transformer by 0.8 BLEU, reduces parameter count by 7 %, and speeds up inference by 15 %. Our analysis shows that multi-scale spectral decomposition offers an interpretable, efficient, and expressive alternative to quadratic self-attention for graph-structured sequence modeling.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2505.07862

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Learning curves theory for hierarchically compositional data with power-law distributed features

Cagnetta, Francesco, Kang, Hyunmo, Wyart, Matthieu

arXiv.org Machine LearningMay-13-2025

Recent theories suggest that Neural Scaling Laws arise whenever the task is linearly decomposed into power-law distributed units. Alternatively, scaling laws also emerge when data exhibit a hierarchically compositional structure, as is thought to occur in language and images. To unify these views, we consider classification and next-token prediction tasks based on probabilistic context-free grammars -- probabilistic models that generate data via a hierarchy of production rules. For classification, we show that having power-law distributed production rules results in a power-law learning curve with an exponent depending on the rules' distribution and a large multiplicative constant that depends on the hierarchical structure. By contrast, for next-token prediction, the distribution of production rules controls the local details of the learning curve, but not the exponent describing the large-scale behaviour.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Machine Learning

2505.07067

Country:

Asia > South Korea > Seoul > Seoul (0.04)
North America > United States > Maryland > Baltimore (0.04)
North America > Canada (0.04)
(3 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.81)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.81)

Add feedback

Scaling Laws and Representation Learning in Simple Hierarchical Languages: Transformers vs. Convolutional Architectures

Cagnetta, Francesco, Favero, Alessandro, Sclocchi, Antonio, Wyart, Matthieu

arXiv.org Machine LearningMay-13-2025

How do neural language models acquire a language's structure when trained for next-token prediction? We address this question by deriving theoretical scaling laws for neural network performance on synthetic datasets generated by the Random Hierarchy Model (RHM) -- an ensemble of probabilistic context-free grammars designed to capture the hierarchical structure of natural language while remaining analytically tractable. Previously, we developed a theory of representation learning based on data correlations that explains how deep learning models capture the hierarchical structure of the data sequentially, one layer at a time. Here, we extend our theoretical framework to account for architectural differences. In particular, we predict and empirically validate that convolutional networks, whose structure aligns with that of the generative process through locality and weight sharing, enjoy a faster scaling of performance compared to transformer models, which rely on global self-attention mechanisms. This finding clarifies the architectural biases underlying neural scaling laws and highlights how representation learning is shaped by the interaction between model architecture and the statistical properties of data.

correlation, machine learning, natural language, (18 more...)

arXiv.org Machine Learning

2505.0707

Country:

Europe > Switzerland > Vaud > Lausanne (0.04)
Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(5 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Attention on Multiword Expressions: A Multilingual Study of BERT-based Models with Regard to Idiomaticity and Microsyntax

Zaitova, Iuliia, Hirak, Vitalii, Abdullah, Badr M., Klakow, Dietrich, Möbius, Bernd, Avgustinova, Tania

arXiv.org Artificial IntelligenceMay-12-2025

This study analyzes the attention patterns of fine-tuned encoder-only models based on the BERT architecture (BERT-based models) towards two distinct types of Multiword Expressions (MWEs): idioms and microsyntactic units (MSUs). Idioms present challenges in semantic non-compositionality, whereas MSUs demonstrate unconventional syntactic behavior that does not conform to standard grammatical categorizations. We aim to understand whether fine-tuning BERT-based models on specific tasks influences their attention to MWEs, and how this attention differs between semantic and syntactic tasks. We examine attention scores to MWEs in both pre-trained and fine-tuned BERT-based models. We utilize monolingual models and datasets in six Indo-European languages - English, German, Dutch, Polish, Russian, and Ukrainian. Our results show that fine-tuning significantly influences how models allocate attention to MWEs. Specifically, models fine-tuned on semantic tasks tend to distribute attention to idiomatic expressions more evenly across layers. Models fine-tuned on syntactic tasks show an increase in attention to MSUs in the lower layers, corresponding with syntactic processing requirements.

computational linguistic, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.06062

Country:

Europe (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.95)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback