Goto

Collaborating Authors

 syntactic feature


Comparing human and LLM proofreading in L2 writing: Impact on lexical and syntactic features

arXiv.org Artificial Intelligence

This study examines the lexical and syntactic interventions of human and LLM proofreading aimed at improving overall intelligibility in identical second language writings, and evaluates the consistency of outcomes across three LLMs (ChatGPT-4o, Llama3.1-8b, Deepseek-r1-8b). Findings show that both human and LLM proofreading enhance bigram lexical features, which may contribute to better coherence and contextual connectedness between adjacent words. However, LLM proofreading exhibits a more generative approach, extensively reworking vocabulary and sentence structures, such as employing more diverse and sophisticated vocabulary and incorporating a greater number of adjective modifiers in noun phrases. The proofreading outcomes are highly consistent in major lexical and syntactic features across the three models.


Multilingual Irony Detection with Dependency Syntax and Neural Models

arXiv.org Artificial Intelligence

This paper presents an in-depth investigation of the effectiveness of dependency-based syntactic features on the irony detection task in a multilingual perspective (English, Spanish, French and Italian). It focuses on the contribution from syntactic knowledge, exploiting linguistic resources where syntax is annotated according to the Universal Dependencies scheme. Three distinct experimental settings are provided. In the first, a variety of syntactic dependency-based features combined with classical machine learning classifiers are explored. In the second scenario, two well-known types of word embeddings are trained on parsed data and tested against gold standard datasets. In the third setting, dependency-based syntactic features are combined into the Multilingual BERT architecture. The results suggest that fine-grained dependency-based syntactic information is informative for the detection of irony.


Do Large Language Models know who did what to whom?

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are commonly criticized for not "understanding" language. However, many critiques focus on cognitive abilities that, in humans, are distinct from language processing. Here, we instead study a kind of understanding tightly linked to language: inferring "who did what to whom" (thematic roles) in a sentence. Does the central training objective of LLMs--word prediction--result in sentence representations that capture thematic roles? In two experiments, we characterized sentence representations in four LLMs. In contrast to human similarity judgments, in LLMs the overall representational similarity of sentence pairs reflected syntactic similarity but not whether their agent and patient assignments were identical vs. reversed. Furthermore, we found little evidence that thematic role information was available in any subset of hidden units. However, some attention heads robustly captured thematic roles, independently of syntax. Therefore, LLMs can extract thematic roles but, relative to humans, this information influences their representations more weakly.


Analyzing the Inner Workings of Transformers in Compositional Generalization

arXiv.org Artificial Intelligence

The compositional generalization abilities of neural models have been sought after for human-like linguistic competence. The popular method to evaluate such abilities is to assess the models' input-output behavior. However, that does not reveal the internal mechanisms, and the underlying competence of such models in compositional generalization remains unclear. To address this problem, we explore the inner workings of a Transformer model by finding an existing subnetwork that contributes to the generalization performance and by performing causal analyses on how the model utilizes syntactic features. We find that the model depends on syntactic features to output the correct answer, but that the subnetwork with much better generalization performance than the whole model relies on a non-compositional algorithm in addition to the syntactic features. We also show that the subnetwork improves its generalization performance relatively slowly during the training compared to the in-distribution one, and the non-compositional solution is acquired in the early stages of the training.


Detection of LLM-Generated Java Code Using Discretized Nested Bigrams

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are currently used extensively to generate code by professionals and students, motivating the development of tools to detect LLM-generated code for applications such as academic integrity and cybersecurity. We address this authorship attribution problem as a binary classification task along with feature identification and extraction. We propose new Discretized Nested Bigram Frequency features on source code groups of various sizes. Compared to prior work, improvements are obtained by representing sparse information in dense membership bins. Experimental evaluation demonstrated that our approach significantly outperformed a commonly used GPT code-detection API and baseline features, with accuracy exceeding 96% compared to 72% and 79% respectively in detecting GPT-rewritten Java code fragments for 976 files with GPT 3.5 and GPT4 using 12 features. We also outperformed three prior works on code author identification in a 40-author dataset. Our approach scales well to larger data sets, and we achieved 99% accuracy and 0.999 AUC for 76,089 files and over 1,000 authors with GPT 4o using 227 features.


What is a word?

arXiv.org Artificial Intelligence

"Despite 2,400 years or so of trying, it is unclear that anyone has ever come up with an adequate definition of any word whatsoever, even the simplest." Surprisingly few linguists and philosophers have a clear model of what a word is, even though words impact basically every aspect of human life. Researchers that regularly publish academic papers about language often rely on outdated, or inaccurate, assumptions about wordhood. As in all scientific disciplines, we have two notions to consider: 1. Our intuitive concept of'word' (which we all have, even though it can be vague, and sometimes hard to articulate fully, like most complex concepts). This is no different from other scientific concepts - for example, 'water' has a very intuitive meaning, but it also is linked to much more technical, formal notions emerging from chemistry and physics (Murphy 2023). This short pedagogical document outlines what the lexicon is most certainly not (though is often mistakenly taken to be), what it might be (based on current good theories), and what some implications for experimental design are. The central features of lexical items have no connection with sensorimotor instructions.


Turkish Native Language Identification

arXiv.org Artificial Intelligence

In this paper, we present the first application of Native Language Identification (NLI) for the Turkish language. NLI involves predicting the writer's first language by analysing their writing in different languages. While most NLI research has focused on English, our study extends its scope to Turkish. We used the recently constructed Turkish Learner Corpus and employed a combination of three syntactic features (CFG production rules, part-of-speech n-grams, and function words) with L2 texts to demonstrate their effectiveness in this task.


Syntactic Structure Processing in the Brain while Listening

arXiv.org Artificial Intelligence

Syntactic parsing is the task of assigning a syntactic structure to a sentence. There are two popular syntactic parsing methods: constituency and dependency parsing. Recent works have used syntactic embeddings based on constituency trees, incremental top-down parsing, and other word syntactic features for brain activity prediction given the text stimuli to study how the syntax structure is represented in the brain's language network. However, the effectiveness of dependency parse trees or the relative predictive power of the various syntax parsers across brain areas, especially for the listening task, is yet unexplored. In this study, we investigate the predictive power of the brain encoding models in three settings: (i) individual performance of the constituency and dependency syntactic parsing based embedding methods, (ii) efficacy of these syntactic parsing based embedding methods when controlling for basic syntactic signals, (iii) relative effectiveness of each of the syntactic embedding methods when controlling for the other. Further, we explore the relative importance of syntactic information (from these syntactic embedding methods) versus semantic information using BERT embeddings. We find that constituency parsers help explain activations in the temporal lobe and middle-frontal gyrus, while dependency parsers better encode syntactic structure in the angular gyrus and posterior cingulate cortex. Although semantic signals from BERT are more effective compared to any of the syntactic features or embedding methods, syntactic embedding methods explain additional variance for a few brain regions.


Tapping the Potential of Coherence and Syntactic Features in Neural Models for Automatic Essay Scoring

arXiv.org Artificial Intelligence

In the prompt-specific holistic score prediction task for Automatic Essay Scoring, the general approaches include pre-trained neural model, coherence model, and hybrid model that incorporate syntactic features with neural model. In this paper, we propose a novel approach to extract and represent essay coherence features with prompt-learning NSP that shows to match the state-of-the-art AES coherence model, and achieves the best performance for long essays. We apply syntactic feature dense embedding to augment BERT-based model and achieve the best performance for hybrid methodology for AES. In addition, we explore various ideas to combine coherence, syntactic information and semantic embeddings, which no previous study has done before. Our combined model also performs better than the SOTA available for combined model, even though it does not outperform our syntactic enhanced neural model. We further offer analyses that can be useful for future study.


Syntactic structures and the general Markov models

arXiv.org Artificial Intelligence

The focus of the present paper is to investigate the following questions: to what extent syntactic features capture phylogenetic relationships and to what extent Markov models are a viable assumption for phylogenetic reconstruction based on syntactic features. For the second, we also consider an alternative that we argue approximates the infinite site evolutionary model. These questions are motivated by the fact that at both lexical and syntactic level, Markov processes are commonly assumed to underlie computational models of language change; for instance, within the Principles and Parameters setting relevant here, Niyogi and Berwick (1997) developed models of language acquisition and language change based on a Markov process in a space of syntactic parameters. In this paper we focus only on language change processes, viewed through the lens of phylogenetic trees of language families. While the model we consider are not directly related to models of language acquisition and parameter setting, the historical changes of syntax within and across language families, through the modification of syntactic parameters, can be seen as an effect of such underlying dynamics.