AITopics | Grammars & Parsing

Collaborating Authors

Grammars & Parsing

News Overviews Instructional Materials AI-Alerts Classics

A Unified Representation for Continuity and Discontinuity: Syntactic and Computational Motivations

arXiv.org Artificial IntelligenceJun-9-2025

The correspondence principle is proposed to enable a unified representation of the representational principles from PSG, DG, and CG . To that end, the paper first illustrates a series of steps in achieving a unified representation for a discontinuous subordinate clause from Turkish as an illustrative case. This affords a new way of approach ing discontinuity in natural language from a theoretical point of view that unites and integrates the basic tenets of PSG, DG, and CG, with significant consequences for syntactic analysis. The n this paper demonstrates that a unified representation can simplify computational complexity with regards to the neurocognitive representation and processing of both continuous and discontinuous sentences vis - à - vis the basic principles of PSG, DG, and CG. 1 Introduction Discontinuity refers to a case of non - adjacency when a predicate and its argument (s) are not adjacent as per the linear order of the sentence -- predicate structure here may apply to constituents such as verb phrases, noun phrases, adjective phrases, etc. It is typically observed in free word order languages including Australian languages such as W arlpiri, Jiwarli, Turkish (Hale, 1982, 1983; Nordlinger, 2014). Figure 1 depicts a schematic representation of continuity and discontinuity.

artificial intelligence, natural language, relation, (16 more...)

arXiv.org Artificial Intelligence

2506.05686

Country:

Europe (0.93)
Asia (0.67)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.28)

Genre: Research Report (0.40)

Industry: Health & Medicine (0.93)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback

Non-Asymptotic Length Generalization

Chen, Thomas, Ma, Tengyu, Li, Zhiyuan

arXiv.org Artificial IntelligenceJun-9-2025

Length generalization is the ability of a learning algorithm to learn a hypothesis which generalizes to longer inputs than the inputs in the training set. In this paper, we provide provable guarantees of length generalization for various classes of functions in an idealized setting. First, we formalize the framework of non-asymptotic length generalization, which requires a computable upper bound for the minimum input length that guarantees length generalization, as a function of the complexity of ground-truth function under some given complexity measure. We refer to this minimum input length to length generalize as length complexity. We show the Minimum-Complexity Interpolator learning algorithm achieves optimal length complexity. We further show that whether a function class admits non-asymptotic length generalization is equivalent to the decidability of its language equivalence problem, which implies that there is no computable upper bound for the length complexity of Context-Free Grammars. On the positive side, we show that the length complexity of Deterministic Finite Automata is $2n - 2$ where $n$ is the number of states of the ground-truth automaton. Our main results are upper bounds of length complexity for a subset of a transformer-related function class called C-RASP (Yang & Chiang, 2024). We show that the length complexity of 1-layer C-RASP functions is $O(T^2)$ when the ground-truth function has precision $T$, and that the length complexity of 2-layer C-RASP functions is $O(T^{O(K)})$ when the ground-truth function has precision $T$ and $K$ heads.

length generalization, logic & formal reasoning, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2506.03085

Country: North America > United States (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.45)

Add feedback

Automated Journalistic Questions: A New Method for Extracting 5W1H in French

Verhaverbeke, Maxence, Gramaccia, Julie A., Khoury, Richard

arXiv.org Artificial IntelligenceJun-9-2025

The 5W1H questions -- who, what, when, where, why and how -- are commonly used in journalism to ensure that an article describes events clearly and systematically. Answering them is a crucial prerequisites for tasks such as summarization, clustering, and news aggregation. In this paper, we design the first automated extraction pipeline to get 5W1H information from French news articles. To evaluate the performance of our algorithm, we also create a corpus of 250 Quebec news articles with 5W1H answers marked by four human annotators. Our results demonstrate that our pipeline performs as well in this task as the large language model GPT-4o.

annotator, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2505.14804

Country: North America > Canada > Quebec (0.25)

Genre: Research Report > New Finding (0.68)

Industry: Media > News (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.68)

Add feedback

Towards a Unified System of Representation for Continuity and Discontinuity in Natural Language

Kandala, Ratna, Mondal, Prakash

arXiv.org Artificial IntelligenceJun-6-2025

Syntactic discontinuity is a grammatical phenomenon in which a constituent is split into more than one part because of the insertion of an element which is not part of the constituent. This is observed in many languages across the world such as Turkish, Russian, Japanese, Warlpiri, Navajo, Hopi, Dyirbal, Yidiny etc. Different formalisms/frameworks in current linguistic theory approach the problem of discontinuous structures in different ways. Each framework/formalism has widely been viewed as an independent and non-converging system of analysis. In this paper, we propose a unified system of representation for both continuity and discontinuity in structures of natural languages by taking into account three formalisms, in particular, Phrase Structure Grammar (PSG) for its widely used notion of constituency, Dependency Grammar (DG) for its head-dependent relations, and Categorial Grammar (CG) for its focus on functor-argument relations. We attempt to show that discontinuous expressions as well as continuous structures can be analysed through a unified mathematical derivation incorporating the representations of linguistic structure in these three grammar formalisms.

artificial intelligence, natural language, relation, (18 more...)

arXiv.org Artificial Intelligence

2506.05235

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
North America > United States > Illinois > Cook County > Chicago (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(13 more...)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback

Behavioural vs. Representational Systematicity in End-to-End Models: An Opinionated Survey

Vegner, Ivan, de Souza, Sydelle, Forch, Valentin, Lewis, Martha, Doumas, Leonidas A. A.

arXiv.org Artificial IntelligenceJun-6-2025

A core aspect of compositionality, systematicity is a desirable property in ML models as it enables strong generalization to novel contexts. This has led to numerous studies proposing benchmarks to assess systematic generalization, as well as models and training regimes designed to enhance it. Many of these efforts are framed as addressing the challenge posed by Fodor and Pylyshyn. However, while they argue for systematicity of representations, existing benchmarks and models primarily focus on the systematicity of behaviour. We emphasize the crucial nature of this distinction. Furthermore, building on Hadley's (1994) taxonomy of systematic generalization, we analyze the extent to which behavioural systematicity is tested by key benchmarks in the literature across language and vision. Finally, we highlight ways of assessing systematicity of representations in ML models as practiced in the field of mechanistic interpretability.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2506.04461

Country:

Europe (1.00)
North America > United States (0.68)
North America > Mexico (0.46)

Genre:

Overview (1.00)
Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(2 more...)

Add feedback

A conclusive remark on linguistic theorizing and language modeling

Chesi, Cristiano

arXiv.org Artificial IntelligenceJun-5-2025

Considering the proliferation of responses to Piantadosi's original paper and the ongoing debate sparked by this special issue of the Italian Journal of Linguistics, it is clear that the discussion has touched a raw nerve in linguistic theorizing . In the original target paper (Chesi, this issue), I illustrated three prototypical (and in many respects, extreme) positions -- the computational, theoretical, and experimental perspectives -- without explicitly endorsing any of them. Instead, I attempted to highlight what I believe are the key weaknesses o f each of these prototypical stances, ultimately concluding that formal (i.e., ' generative ') linguistics -- more specifically, Minimalis m, my theoretical comfort zone -- must adopt practices and tools that are common in both computational and experimental fields . As noted by most respondents, the title and some of the more extreme statements were intended as mild provocations to draw attention to core issues affecting linguistic theorizing . M y position -- somehow obscured behind the ' three - body problem ' -- is that any relevant scientific progress is driven by theoretical insight, not by trawling using experimental or computational methods that are cost - inefficient, energy - intensive, and ultimately unsustainable . Moreover, in full agreement with most of the replies, I believe that the success of certain large language models (L L Ms), which are based on specific architectural assumptions, do es not constitute a refutation of the generative paradigm. On the contrary, it strongly supports several key intuitions that have emerged within the generative linguistic tradition (Rizzi this issue) . H owever, a concrete problem of ' incommensurability ' arises (Hao this issue), as differing methodologies and specialized jargon (Butt this issue) often result in circular, unresolved discussions .

italian journal, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.03268

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Europe > Austria > Vienna (0.14)
(7 more...)

Genre:

Research Report (0.40)
Personal > Opinion (0.34)

Industry: Energy (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)

Add feedback

Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios?

Riabi, Arij, Sagot, Benoît, Seddah, Djamé

arXiv.org Artificial IntelligenceJun-4-2025

Recent impressive improvements in NLP, largely based on the success of contextual neural language models, have been mostly demonstrated on at most a couple dozen high-resource languages. Building language models and, more generally, NLP systems for non-standardized and low-resource languages remains a challenging task. In this work, we focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi, found mostly on social media and messaging communication. In this low-resource scenario with data displaying a high level of variability, we compare the downstream performance of a character-based language model on part-of-speech tagging and dependency parsing to that of monolingual and multilingual models. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models. Confirming these results a on much larger data set of noisy French user-generated content, we argue that such character-based language models can be an asset for NLP in low-resource and high language variability set-tings.

computational linguistic, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2110.13658

Country:

Europe (1.00)
North America > United States > Minnesota (0.28)
North America > United States > California (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.68)

Add feedback

IMPARA-GED: Grammatical Error Detection is Boosting Reference-free Grammatical Error Quality Estimator

Sakai, Yusuke, Goto, Takumi, Watanabe, Taro

arXiv.org Artificial IntelligenceJun-4-2025

We propose IMPARA-GED, a novel reference-free automatic grammatical error correction (GEC) evaluation method with grammatical error detection (GED) capabilities. We focus on the quality estimator of IMPARA, an existing automatic GEC evaluation method, and construct that of IMPARA-GED using a pre-trained language model with enhanced GED capabilities. Experimental results on SEEDA, a meta-evaluation dataset for automatic GEC evaluation methods, demonstrate that IMPARA-GED achieves the highest correlation with human sentence-level evaluations.

artificial intelligence, computational linguistic, natural language, (14 more...)

arXiv.org Artificial Intelligence

2506.02899

Country:

Europe (1.00)
North America > United States > Minnesota (0.28)
Asia > Japan > Honshū (0.28)

Genre: Research Report > New Finding (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback

Structure-Aware Fill-in-the-Middle Pretraining for Code

Gong, Linyuan, Cheung, Alvin, Elhoushi, Mostafa, Wang, Sida

arXiv.org Artificial IntelligenceJun-3-2025

Fill-in-the-Middle (FIM) is a common pretraining method for code LLMs, where models complete code segments given surrounding context. However, existing LLMs treat code as plain text and mask random character spans. We propose and evaluate AST-FIM, a pretraining strategy that leverages Abstract Syntax Trees (ASTs) to mask complete syntactic structures at scale, ensuring coherent training examples better aligned with universal code structures and common code editing patterns such as blocks, expressions, or functions. To evaluate real-world fill-in-the-middle (FIM) programming tasks, we introduce Real-FIM-Eval, a benchmark derived from 30,000+ GitHub commits across 12 languages. On infilling tasks, experiments on 1B and 8B parameter models show that AST-FIM is particularly beneficial for real-world code editing as it outperforms standard random-character FIM by up to 5 pts on standard FIM benchmarks. Our code is publicly available at https://github.com/gonglinyuan/ast_fim.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.00204

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Comparing LLM-generated and human-authored news text using formal syntactic theory

Zamaraeva, Olga, Flickinger, Dan, Bond, Francis, Gómez-Rodríguez, Carlos

arXiv.org Artificial IntelligenceJun-3-2025

This study provides the first comprehensive comparison of New York Times-style text generated by six large language models against real, human-authored NYT writing. The comparison is based on a formal syntactic theory. We use Head-driven Phrase Structure Grammar (HPSG) to analyze the grammatical structure of the texts. We then investigate and illustrate the differences in the distributions of HPSG grammar types, revealing systematic distinctions between human and LLM-generated writing. These findings contribute to a deeper understanding of the syntactic behavior of LLMs as well as humans, within the NYT genre.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.01407

Country: North America > United States (0.68)

Genre:

Research Report > Experimental Study (0.88)
Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback