Goto

Collaborating Authors

 Grammars & Parsing


Citation Parsing and Analysis with Language Models

arXiv.org Artificial Intelligence

A key type of resource needed to address global inequalities in knowledge production and dissemination is a tool that can support journals in understanding how knowledge circulates. The absence of such a tool has resulted in comparatively less information about networks of knowledge sharing in the Global South. In turn, this gap authorizes the exclusion of researchers and scholars from the South in indexing services, reinforcing colonial arrangements that de-center and minoritize those scholars. In order to support citation network tracking on a global scale, we investigate the capacity of open-weight language models to mark up manuscript citations in an indexable format. We assembled a dataset of matched plaintext and annotated citations from preprints and published research papers. Then, we evaluated a number of open-weight language models on the annotation task. We find that, even out of the box, today's language models achieve high levels of accuracy on identifying the constituent components of each citation, outperforming state-of-the-art methods. Moreover, the smallest model we evaluated, Qwen3-0.6B, can parse all fields with high accuracy in $2^5$ passes, suggesting that post-training is likely to be effective in producing small, robust citation parsing models. Such a tool could greatly improve the fidelity of citation networks and thus meaningfully improve research indexing and discovery, as well as further metascientific research.


Transfer of Structural Knowledge from Synthetic Languages

arXiv.org Artificial Intelligence

This work explores transfer learning from several synthetic languages to English. We investigate the structure of the embeddings in the fine-tuned models, the information they contain, and the capabilities of the fine-tuned models on simple linguistic tasks. We also introduce a new synthetic language that leads to better transfer to English than the languages used in previous research. Finally, we introduce Tiny-Cloze Benchmark - a new synthetic benchmark for natural language understanding that is more informative for less powerful models. We use Tiny-Cloze Benchmark to evaluate fine-tuned models in several domains demonstrating that fine-tuning on a new synthetic language allows for better performance on a variety of tasks.


Semantic-based Unsupervised Framing Analysis (SUFA): A Novel Approach for Computational Framing Analysis

arXiv.org Artificial Intelligence

This research presents a novel approach to computational framing analysis, called Semantic Relations-based Unsupervised Framing Analysis (SUFA). SUFA leverages semantic relations and dependency parsing algorithms to identify and assess entity-centric emphasis frames in news media reports. This innovative method is derived from two studies -- qualitative and computational -- using a dataset related to gun violence, demonstrating its potential for analyzing entity-centric emphasis frames. This article discusses SUFA's strengths, limitations, and application procedures. Overall, the SUFA approach offers a significant methodological advancement in computational framing analysis, with its broad applicability across both the social sciences and computational domains.


Neural Morphological Tagging for Nguni Languages

arXiv.org Artificial Intelligence

Morphological parsing is the task of decomposing words into morphemes, the smallest units of meaning in a language, and labelling their grammatical roles. It is a particularly challenging task for agglutinative languages, such as the Nguni languages of South Africa, which construct words by concatenating multiple morphemes. A morphological parsing system can be framed as a pipeline with two separate components, a segmenter followed by a tagger. This paper investigates the use of neural methods to build morphological taggers for the four Nguni languages. We compare two classes of approaches: training neural sequence labellers (LSTMs and neural CRFs) from scratch and finetuning pretrained language models. We compare performance across these two categories, as well as to a traditional rule-based morphological parser. Neural taggers comfortably outperform the rule-based baseline and models trained from scratch tend to outperform pretrained models. We also compare parsing results across different upstream segmenters and with varying linguistic input features. Our findings confirm the viability of employing neural taggers based on pre-existing morphological segmenters for the Nguni languages.


Neuro-Symbolic Query Compiler

arXiv.org Artificial Intelligence

Precise recognition of search intent in Retrieval-Augmented Generation (RAG) systems remains a challenging goal, especially under resource constraints and for complex queries with nested structures and dependencies. This paper presents QCompiler, a neuro-symbolic framework inspired by linguistic grammar rules and compiler design, to bridge this gap. It theoretically designs a minimal yet sufficient Backus-Naur Form (BNF) grammar $G[q]$ to formalize complex queries. Unlike previous methods, this grammar maintains completeness while minimizing redundancy. Based on this, QCompiler includes a Query Expression Translator, a Lexical Syntax Parser, and a Recursive Descent Processor to compile queries into Abstract Syntax Trees (ASTs) for execution. The atomicity of the sub-queries in the leaf nodes ensures more precise document retrieval and response generation, significantly improving the RAG system's ability to address complex queries.


Reassessing Graph Linearization for Sequence-to-sequence AMR Parsing: On the Advantages and Limitations of Triple-Based Encoding

arXiv.org Artificial Intelligence

Sequence-to-sequence models are widely used to train Abstract Meaning Representation (Banarescu et al., 2013, AMR) parsers. To train such models, AMR graphs have to be linearized into a one-line text format. While Penman encoding is typically used for this purpose, we argue that it has limitations: (1) for deep graphs, some closely related nodes are located far apart in the linearized text (2) Penman's tree-based encoding necessitates inverse roles to handle node re-entrancy, doubling the number of relation types to predict. To address these issues, we propose a triple-based linearization method and compare its efficiency with Penman linearization. Although triples are well suited to represent a graph, our results suggest room for improvement in triple encoding to better compete with Penman's concise and explicit representation of a nested graph structure.


Graph Laplacian Wavelet Transformer via Learnable Spectral Decomposition

arXiv.org Artificial Intelligence

We introduce the Graph W avelet Transformer (GWT), a novel architecture that replaces this bottleneck with a learnable, multi-scale wavelet transform defined over an explicit graph Laplacian derived from syntactic or semantic parses. By parameterizing K N bandpass filters in the graph Fourier domain, GWT achieves a linear-time mixing operator that simultaneously captures local syntactic dependencies and global semantic context. We provide a rigorous mathematical formulation of the spectral filtering and mixing process, integrate GWT modules into a standard Graph Transformer backbone, and evaluate on the WMT14 English-German translation benchmark. Empirical results demonstrate that GWT outperforms the baseline Graph Transformer by 0.8 BLEU, reduces parameter count by 7 %, and speeds up inference by 15 %. Our analysis shows that multi-scale spectral decomposition offers an interpretable, efficient, and expressive alternative to quadratic self-attention for graph-structured sequence modeling.


Learning curves theory for hierarchically compositional data with power-law distributed features

arXiv.org Machine Learning

Recent theories suggest that Neural Scaling Laws arise whenever the task is linearly decomposed into power-law distributed units. Alternatively, scaling laws also emerge when data exhibit a hierarchically compositional structure, as is thought to occur in language and images. To unify these views, we consider classification and next-token prediction tasks based on probabilistic context-free grammars -- probabilistic models that generate data via a hierarchy of production rules. For classification, we show that having power-law distributed production rules results in a power-law learning curve with an exponent depending on the rules' distribution and a large multiplicative constant that depends on the hierarchical structure. By contrast, for next-token prediction, the distribution of production rules controls the local details of the learning curve, but not the exponent describing the large-scale behaviour.


Scaling Laws and Representation Learning in Simple Hierarchical Languages: Transformers vs. Convolutional Architectures

arXiv.org Machine Learning

How do neural language models acquire a language's structure when trained for next-token prediction? We address this question by deriving theoretical scaling laws for neural network performance on synthetic datasets generated by the Random Hierarchy Model (RHM) -- an ensemble of probabilistic context-free grammars designed to capture the hierarchical structure of natural language while remaining analytically tractable. Previously, we developed a theory of representation learning based on data correlations that explains how deep learning models capture the hierarchical structure of the data sequentially, one layer at a time. Here, we extend our theoretical framework to account for architectural differences. In particular, we predict and empirically validate that convolutional networks, whose structure aligns with that of the generative process through locality and weight sharing, enjoy a faster scaling of performance compared to transformer models, which rely on global self-attention mechanisms. This finding clarifies the architectural biases underlying neural scaling laws and highlights how representation learning is shaped by the interaction between model architecture and the statistical properties of data.


Attention on Multiword Expressions: A Multilingual Study of BERT-based Models with Regard to Idiomaticity and Microsyntax

arXiv.org Artificial Intelligence

This study analyzes the attention patterns of fine-tuned encoder-only models based on the BERT architecture (BERT-based models) towards two distinct types of Multiword Expressions (MWEs): idioms and microsyntactic units (MSUs). Idioms present challenges in semantic non-compositionality, whereas MSUs demonstrate unconventional syntactic behavior that does not conform to standard grammatical categorizations. We aim to understand whether fine-tuning BERT-based models on specific tasks influences their attention to MWEs, and how this attention differs between semantic and syntactic tasks. We examine attention scores to MWEs in both pre-trained and fine-tuned BERT-based models. We utilize monolingual models and datasets in six Indo-European languages - English, German, Dutch, Polish, Russian, and Ukrainian. Our results show that fine-tuning significantly influences how models allocate attention to MWEs. Specifically, models fine-tuned on semantic tasks tend to distribute attention to idiomatic expressions more evenly across layers. Models fine-tuned on syntactic tasks show an increase in attention to MSUs in the lower layers, corresponding with syntactic processing requirements.