Goto

Collaborating Authors

 Grammars & Parsing


Acquisition of Recursive Possessives and Recursive Locatives in Mandarin

arXiv.org Artificial Intelligence

Language is the cornerstone of human communication, and the complexity of language lies in the diversity and recursion of its structure. Chomsky (1957) introduced the concept of recursion into natural language, arguing that the grammar in human natural language was a finite set of recursive rules by which an infinite number of linguistic expressions could be generated. In Corballis' (2014) words, the claim that recursion is the essence of natural language has been a continuing theme of Chomsky's work since his 1957 book Syntactic Structures. This theme is reiterated in Hauser et al. (2002), proposing that the faculty of language in the narrow sense only includes recursion, the only uniquely human component of the faculty of language. This proposal is summarized as the "recursion-only hypothesis" in Jackendoff and Pinker (2005: 212), which highlights the importance of recursion in linguistics. In spited of the lack of a consistent definition of (linguistic) recursion in the literature, most literature involves category recursion, which is defined as the "embedding of a category inside another of the same category". For instance, Martins and Fitch (2014) claim that recursion has been used to characterize the process of embedding a constituent of a certain kind of category inside another constituent of the same kind. This "embedding" process naturally generates hierarchical structures that display similar properties across different levels of embedding, and, thus, the feature of "self-similarity" is a signature of recursive structures. To illustrate that, they hold that the compound noun [[student] committee] (which has the structure [[[A]A] ]) is recursive since a noun phrase (NP) is embedded inside another NP, while a sentence with a noun plus a verb such as [[trees] grow] (which has the structure [[[A]B] ]) is non-recursive since a constituent of a given type of category is not embedded within a constituent of that same type.


Semantic Role Labeling of NomBank Partitives

arXiv.org Artificial Intelligence

This article is about Semantic Role Labeling for English partitive nouns (5%/REL of the price/ARG1; The price/ARG1 rose 5 percent/REL) in the NomBank annotated corpus. Several systems are described using traditional and transformer-based machine learning, as well as ensembling. Our highest scoring system achieves an F1 of 91.74% using "gold" parses from the Penn Treebank and 91.12% when using the Berkeley Neural parser. This research includes both classroom and experimental settings for system development.


A Fusion Approach of Dependency Syntax and Sentiment Polarity for Feature Label Extraction in Commodity Reviews

arXiv.org Artificial Intelligence

This study analyzes 13,218 product reviews from JD.com, covering four categories: mobile phones, computers, cosmetics, and food. A novel method for feature label extraction is proposed by integrating dependency parsing and sentiment polarity analysis. The proposed method addresses the challenges of low robustness in existing extraction algorithms and significantly enhances extraction accuracy. Experimental results show that the method achieves an accuracy of 0.7, with recall and F-score both stabilizing at 0.8, demonstrating its effectiveness. However, challenges such as dependence on matching dictionaries and the limited scope of extracted feature tags require further investigation in future research.


Underutilization of Syntactic Processing by Chinese Learners of English in Comprehending English Sentences, Evidenced from Adapted Garden-Path Ambiguity Experiment

arXiv.org Artificial Intelligence

Many studies have revealed that sentence comprehension relies more on semantic processing than on syntactic processing. However, previous studies have predominantly emphasized the preference for semantic processing, focusing on the semantic perspective. In contrast, this current study highlights the under-utilization of syntactic processing, from a syntactic perspective. Based on the traditional garden-path experiment, which involves locally ambiguous but globally unambiguous sentences, this study's empirical experiment innovatively crafted an adapted version featuring semantically ambiguous but syntactically unambiguous sentences to meet its specific research objective. This experiment, involving 140 subjects, demonstrates through descriptive and inferential statistical analyses using SPSS, Graph Pad Prism, and Cursor that Chinese learners of English tend to under-utilize syntactic processing when comprehending English sentences. The study identifies two types of parsing under-utilization: partial and complete. Further exploration reveals that trial and error in syntactic processing contributes to both. Consequently, this study lays a foundation for the development of a novel parsing method designed to fully integrate syntactic processing into sentence comprehension, thereby enhancing the level of English sentence comprehension for Chinese learners of English.


Overview of the First Workshop on Language Models for Low-Resource Languages (LoResLM 2025)

arXiv.org Artificial Intelligence

The first Workshop on Language Models for Low-Resource Languages (LoResLM 2025) was held in conjunction with the 31st International Conference on Computational Linguistics (COLING 2025) in Abu Dhabi, United Arab Emirates. This workshop mainly aimed to provide a forum for researchers to share and discuss their ongoing work on language models (LMs) focusing on low-resource languages, following the recent advancements in neural language models and their linguistic biases towards high-resource languages. LoResLM 2025 attracted notable interest from the natural language processing (NLP) community, resulting in 35 accepted papers from 52 submissions. These contributions cover a broad range of low-resource languages from eight language families and 13 diverse research areas, paving the way for future possibilities and promoting linguistic inclusivity in NLP.


Analysis and Visualization of Linguistic Structures in Large Language Models: Neural Representations of Verb-Particle Constructions in BERT

arXiv.org Artificial Intelligence

This study investigates the internal representations of verb-particle combinations within transformer-based large language models (LLMs), specifically examining how these models capture lexical and syntactic nuances at different neural network layers. Employing the BERT architecture, we analyse the representational efficacy of its layers for various verb-particle constructions such as 'agree on', 'come back', and 'give up'. Our methodology includes a detailed dataset preparation from the British National Corpus, followed by extensive model training and output analysis through techniques like multi-dimensional scaling (MDS) and generalized discrimination value (GDV) calculations. Results show that BERT's middle layers most effectively capture syntactic structures, with significant variability in representational accuracy across different verb categories. These findings challenge the conventional uniformity assumed in neural network processing of linguistic elements and suggest a complex interplay between network architecture and linguistic representation. Our research contributes to a better understanding of how deep learning models comprehend and process language, offering insights into the potential and limitations of current neural approaches to linguistic analysis. This study not only advances our knowledge in computational linguistics but also prompts further research into optimizing neural architectures for enhanced linguistic precision.


Sometimes I am a Tree: Data Drives Unstable Hierarchical Generalization

arXiv.org Artificial Intelligence

Language models (LMs), like other neural networks, often favor shortcut heuristics based on surface-level patterns. Although LMs behave like n-gram models early in training, they must eventually learn hierarchical syntactic representations to correctly apply grammatical rules out-of-distribution (OOD). In this work, we use case studies of English grammar to explore how complex, diverse training data drives models to generalize OOD. We construct a framework that unifies our understanding of random variation with training dynamics, rule selection with memorization, and data diversity with complexity. We show that these factors are nuanced, and that intermediate levels of diversity and complexity lead to inconsistent behavior across random seeds and to unstable training dynamics. Our findings emphasize the critical role of training data in shaping generalization patterns and illuminate how competing model strategies lead to inconsistent generalization outcomes across random seeds.


Digestion Algorithm in Hierarchical Symbolic Forests: A Fast Text Normalization Algorithm and Semantic Parsing Framework for Specific Scenarios and Lightweight Deployment

arXiv.org Artificial Intelligence

Text Normalization and Semantic Parsing have numerous applications in natural language processing, such as natural language programming, paraphrasing, data augmentation, constructing expert systems, text matching, and more. Despite the prominent achievements of deep learning in Large Language Models (LLMs), the interpretability of neural network architectures is still poor, which affects their credibility and hence limits the deployments of risk-sensitive scenarios. In certain scenario-specific domains with scarce data, rapidly obtaining a large number of supervised learning labels is challenging, and the workload of manually labeling data would be enormous. Catastrophic forgetting in neural networks further leads to low data utilization rates. In situations where swift responses are vital, the density of the model makes local deployment difficult and the response time long, which is not conducive to local applications of these fields. Inspired by the multiplication rule, a principle of combinatorial mathematics, and human thinking patterns, a multilayer framework along with its algorithm, the Digestion Algorithm in Hierarchical Symbolic Forests (DAHSF), is proposed to address these above issues, combining text normalization and semantic parsing workflows. The Chinese Scripting Language "Fire Bunny Intelligent Development Platform V2.0" is an important test and application of the technology discussed in this paper. DAHSF can run locally in scenario-specific domains on little datasets, with model size and memory usage optimized by at least two orders of magnitude, thus improving the execution speed, and possessing a promising optimization outlook.


Compositional Generalization Across Distributional Shifts with Sparse Tree Operations

arXiv.org Artificial Intelligence

Neural networks continue to struggle with compositional generalization, and this issue is exacerbated by a lack of massive pre-training. One successful approach for developing neural systems which exhibit human-like compositional generalization is \textit{hybrid} neurosymbolic techniques. However, these techniques run into the core issues that plague symbolic approaches to AI: scalability and flexibility. The reason for this failure is that at their core, hybrid neurosymbolic models perform symbolic computation and relegate the scalable and flexible neural computation to parameterizing a symbolic system. We investigate a \textit{unified} neurosymbolic system where transformations in the network can be interpreted simultaneously as both symbolic and neural computation. We extend a unified neurosymbolic architecture called the Differentiable Tree Machine in two central ways. First, we significantly increase the model's efficiency through the use of sparse vector representations of symbolic structures. Second, we enable its application beyond the restricted set of tree2tree problems to the more general class of seq2seq problems. The improved model retains its prior generalization capabilities and, since there is a fully neural path through the network, avoids the pitfalls of other neurosymbolic techniques that elevate symbolic computation over neural computation.


Relational Programming with Foundation Models

arXiv.org Artificial Intelligence

Foundation models have vast potential to enable diverse AI applications. The powerful yet incomplete nature of these models has spurred a wide range of mechanisms to augment them with capabilities such as in-context learning, information retrieval, and code interpreting. We propose Vieira, a declarative framework that unifies these mechanisms in a general solution for programming with foundation models. Vieira follows a probabilistic relational paradigm and treats foundation models as stateless functions with relational inputs and outputs. It supports neuro-symbolic applications by enabling the seamless combination of such models with logic programs, as well as complex, multi-modal applications by streamlining the composition of diverse sub-models. We implement Vieira by extending the Scallop compiler with a foreign interface that supports foundation models as plugins. We implement plugins for 12 foundation models including GPT, CLIP, and SAM. We evaluate Vieira on 9 challenging tasks that span language, vision, and structured and vector databases. Our evaluation shows that programs in Vieira are concise, can incorporate modern foundation models, and have comparable or better accuracy than competitive baselines.