Goto

Collaborating Authors

 Grammars & Parsing


Evaluating Compositional Approaches for Focus and Sentiment Analysis

arXiv.org Artificial Intelligence

While quantitative evaluations of compositional and non-compositional approaches in SA exist in NLP, similar quantitative evaluations are very rare in FA in Linguistics that deal with linguistic expressions representing focus or emphasis such as "it was John who left". We fill this gap in research by arguing that compositional rules in SA also apply to FA because FA and SA are closely related meaning that SA is part of FA. Our compositional approach in SA exploits basic syntactic rules such as rules of modification, coordination, and negation represented in the formalism of Universal Dependencies (UDs) in English and applied to words representing sentiments from sentiment dictionaries. Some of the advantages of our compositional analysis method for SA in contrast to non-compositional analysis methods are interpretability and explainability. We test the accuracy of our compositional approach and compare it with a non-compositional approach VADER that uses simple heuristic rules to deal with negation, coordination and modification. In contrast to previous related work that evaluates compositionality in SA on long reviews, this study uses more appropriate datasets to evaluate compositionality. In addition, we generalize the results of compositional approaches in SA to compositional approaches in FA.


SASST: Leveraging Syntax-Aware Chunking and LLMs for Simultaneous Speech Translation

arXiv.org Artificial Intelligence

This work proposes a grammar-based chunking strategy that segments input streams into semantically complete units by parsing dependency relations (e.g., noun phrase boundaries, verb-object structures) and punctuation features. The method ensures chunk coherence and minimizes semantic fragmentation. Building on this mechanism, we present SASST (Syntax-Aware Simultaneous Speech Translation), an end-to-end framework integrating frozen Whisper encoder and decoder-only LLM. The unified architecture dynamically outputs translation tokens or symbols to jointly optimize translation timing and content, with target-side reordering addressing word-order divergence. Experiments on CoVoST2 multilingual corpus En-{De, Zh, Ja} demonstrate significant translation quality improvements across languages and validate the effectiveness of syntactic structures in LLM-driven SimulST systems.


Multi-Modal Semantic Parsing for the Interpretation of Tombstone Inscriptions

arXiv.org Artificial Intelligence

Tombstones are historically and culturally rich artifacts, encapsulating individual lives, community memory, historical narratives and artistic expression. Yet, many tombstones today face significant preservation challenges, including physical erosion, vandalism, environmental degradation, and political shifts. In this paper, we introduce a novel multi-modal framework for tombstones digitization, aiming to improve the interpretation, organization and retrieval of tombstone content. Our approach leverages vision-language models (VLMs) to translate tombstone images into structured Tombstone Meaning Representations (TMRs), capturing both image and text information. To further enrich semantic parsing, we incorporate retrieval-augmented generation (RAG) for integrate externally dependent elements such as toponyms, occupation codes, and ontological concepts. Compared to traditional OCR-based pipelines, our method improves parsing accuracy from an F1 score of 36.1 to 89.5. We additionally evaluate the model's robustness across diverse linguistic and cultural inscriptions, and simulate physical degradation through image fusion to assess performance under noisy or damaged conditions. Our work represents the first attempt to formalize tombstone understanding using large vision-language models, presenting implications for heritage preservation.


Probing Syntax in Large Language Models: Successes and Remaining Challenges

arXiv.org Artificial Intelligence

The syntactic structures of sentences can be readily read-out from the activations of large language models (LLMs). However, the ``structural probes'' that have been developed to reveal this phenomenon are typically evaluated on an indiscriminate set of sentences. Consequently, it remains unclear whether structural and/or statistical factors systematically affect these syntactic representations. To address this issue, we conduct an in-depth analysis of structural probes on three controlled benchmarks. Our results are three-fold. First, structural probes are biased by a superficial property: the closer two words are in a sentence, the more likely structural probes will consider them as syntactically linked. Second, structural probes are challenged by linguistic properties: they poorly represent deep syntactic structures, and get interfered by interacting nouns or ungrammatical verb forms. Third, structural probes do not appear to be affected by the predictability of individual words. Overall, this work sheds light on the current challenges faced by structural probes. Providing a benchmark made of controlled stimuli to better evaluate their performance.


Analysing Temporal Reasoning in Description Logics Using Formal Grammars

arXiv.org Artificial Intelligence

We establish a correspondence between (fragments of) $\mathcal{TEL}^\bigcirc$, a temporal extension of the $\mathcal{EL}$ description logic with the LTL operator $\bigcirc^k$, and some specific kinds of formal grammars, in particular, conjunctive grammars (context-free grammars equipped with the operation of intersection). This connection implies that $\mathcal{TEL}^\bigcirc$ does not possess the property of ultimate periodicity of models, and further leads to undecidability of query answering in $\mathcal{TEL}^\bigcirc$, closing a question left open since the introduction of $\mathcal{TEL}^\bigcirc$. Moreover, it also allows to establish decidability of query answering for some new interesting fragments of $\mathcal{TEL}^\bigcirc$, and to reuse for this purpose existing tools and algorithms for conjunctive grammars.


Exploring In-Context Learning for Frame-Semantic Parsing

arXiv.org Artificial Intelligence

Frame Semantic Parsing (FSP) entails identifying predicates and labeling their arguments according to Frame Semantics. This paper investigates the use of In-Context Learning (ICL) with Large Language Models (LLMs) to perform FSP without model fine-tuning. We propose a method that automatically generates task-specific prompts for the Frame Identification (FI) and Frame Semantic Role Labeling (FSRL) subtasks, relying solely on the FrameNet database. These prompts, constructed from frame definitions and annotated examples, are used to guide six different LLMs. Experiments are conducted on a subset of frames related to violent events. The method achieves competitive results, with F1 scores of 94.3% for FI and 77.4% for FSRL. The findings suggest that ICL offers a practical and effective alternative to traditional fine-tuning for domain-specific FSP tasks.


Code-Switching and Syntax: A Large-Scale Experiment

arXiv.org Artificial Intelligence

The theoretical code-switching (CS) literature provides numerous pointwise investigations that aim to explain patterns in CS, i.e. why bilinguals switch language in certain positions in a sentence more often than in others. A resulting consensus is that CS can be explained by the syntax of the contributing languages. There is however no large-scale, multi-language, cross-phenomena experiment that tests this claim. When designing such an experiment, we need to make sure that the system that is predicting where bilinguals tend to switch has access only to syntactic information. We provide such an experiment here. Results show that syntax alone is sufficient for an automatic system to distinguish between sentences in minimal pairs of CS, to the same degree as bilingual humans. Furthermore, the learnt syntactic patterns generalise well to unseen language pairs.


When Does Metadata Conditioning (NOT) Work for Language Model Pre-Training? A Study with Context-Free Grammars

arXiv.org Artificial Intelligence

The ability to acquire latent semantics is one of the key properties that determines the performance of language models. One convenient approach to invoke this ability is to prepend metadata (e.g. URLs, domains, and styles) at the beginning of texts in the pre-training data, making it easier for the model to access latent semantics before observing the entire text. Previous studies have reported that this technique actually improves the performance of trained models in downstream tasks; however, this improvement has been observed only in specific downstream tasks, without consistent enhancement in average next-token prediction loss. To understand this phenomenon, we closely investigate how prepending metadata during pre-training affects model performance by examining its behavior using artificial data. Interestingly, we found that this approach produces both positive and negative effects on the downstream tasks. We demonstrate that the effectiveness of the approach depends on whether latent semantics can be inferred from the downstream task's prompt. Specifically, through investigations using data generated by probabilistic context-free grammars, we show that training with metadata helps improve model's performance when the given context is long enough to infer the latent semantics. In contrast, the technique negatively impacts performance when the context lacks the necessary information to make an accurate posterior inference.


Data Augmentation for Spoken Grammatical Error Correction

arXiv.org Artificial Intelligence

While there exist strong benchmark datasets for grammatical error correction (GEC), high-quality annotated spoken datasets for Spoken GEC (SGEC) are still under-resourced. In this paper, we propose a fully automated method to generate audio-text pairs with grammatical errors and disfluencies. Moreover, we propose a series of objective metrics that can be used to evaluate the generated data and choose the more suitable dataset for SGEC. The goal is to generate an augmented dataset that maintains the textual and acoustic characteristics of the original data while providing new types of errors. This augmented dataset should augment and enrich the original corpus without altering the language assessment scores of the second language (L2) learners. We evaluate the use of the augmented corpus both for written GEC (the text part) and for SGEC (the audio-text pairs). Our experiments are conducted on the S\&I Corpus, the first publicly available speech dataset with grammar error annotations.


A Diagrammatic Calculus for a Functional Model of Natural Language Semantics

arXiv.org Artificial Intelligence

In this paper, we study a functional programming approach to natural language semantics, allowing us to increase the expressiveness of a more traditional denotation style. We will formalize a category based type and effect system to represent the semantic difference between syntactically equivalent expressions. We then construct a diagrammatic calculus to model parsing and handling of effects, providing a method to efficiently compute the denotations for sentences.