Grammars & Parsing
MRL Parsing Without Tears: The Case of Hebrew
Shmidman, Shaltiel, Shmidman, Avi, Koppel, Moshe, Tsarfaty, Reut
Syntactic parsing remains a critical tool for relation extraction and information extraction, especially in resource-scarce languages where LLMs are lacking. Yet in morphologically rich languages (MRLs), where parsers need to identify multiple lexical units in each token, existing systems suffer in latency and setup complexity. Some use a pipeline to peel away the layers: first segmentation, then morphology tagging, and then syntax parsing; however, errors in earlier layers are then propagated forward. Others use a joint architecture to evaluate all permutations at once; while this improves accuracy, it is notoriously slow. In contrast, and taking Hebrew as a test case, we present a new "flipped pipeline": decisions are made directly on the whole-token units by expert classifiers, each one dedicated to one specific task. The classifiers are independent of one another, and only at the end do we synthesize their predictions. This blazingly fast approach sets a new SOTA in Hebrew POS tagging and dependency parsing, while also reaching near-SOTA performance on other Hebrew NLP tasks. Because our architecture does not rely on any language-specific resources, it can serve as a model to develop similar parsers for other MRLs.
SPAWNing Structural Priming Predictions from a Cognitively Motivated Parser
Structural priming is a widely used psycholinguistic paradigm to study human sentence representations. In this work we propose a framework for using empirical priming patterns to build a theory characterizing the structural representations humans construct when processing sentences. This framework uses a new cognitively motivated parser, SPAWN, to generate quantitative priming predictions from theoretical syntax and evaluate these predictions with empirical human behavior. As a case study, we apply this framework to study reduced relative clause representations in English. We use SPAWN to generate priming predictions from two theoretical accounts which make different assumptions about the structure of relative clauses. We find that the predictions from only one of these theories (Participial-Phase) align with empirical priming patterns, thus highlighting which assumptions about relative clause better capture human sentence representations.
MaiBaam Annotation Guidelines
Blaschke, Verena, Kovačić, Barbara, Peng, Siyao, Plank, Barbara
This document provides annotation guidelines for MaiBaam, a Bavarian corpus annotated with part-of-speech (POS) tags and syntactic dependencies. MaiBaam belongs to the Universal Dependencies (UD) project (Zeman et al., 2023; de Marneffe et al., 2021), and our annotations elaborate on the general and German UD version 2 guidelines. This document is structured broadly in the order we prepare and annotate sentences: first, preprocessing and tokenization ( 1), then general recaps of POS tags ( 2) and dependencies ( 3), before we go into annotation decisions that would also apply to German ( 4) and lastly decisions that are specific to Bavarian grammar ( 5). Many examples are written in German, since the standardized orthography makes it easier to search this PDF. We only annotate UD-style POS tags (UPOS tags) and dependencies and add the SpaceAfter=No feature where appropriate, but do not add any other information (no lemma, XPOS tags, morphological features, enhanced dependencies or miscellaneous annotations). This document is primarily directed at present and future annotators of MaiBaam. We publish it to additionally allow others working with MaiBaam or annotating similar data to better understand the decisions we have made.
Schema-Aware Multi-Task Learning for Complex Text-to-SQL
Conventional text-to-SQL parsers are not good at synthesizing complex SQL queries that involve multiple tables or columns, due to the challenges inherent in identifying the correct schema items and performing accurate alignment between question and schema items. To address the above issue, we present a schema-aware multi-task learning framework (named MTSQL) for complicated SQL queries. Specifically, we design a schema linking discriminator module to distinguish the valid question-schema linkings, which explicitly instructs the encoder by distinctive linking relations to enhance the alignment quality. On the decoder side, we define 6-type relationships to describe the connections between tables and columns (e.g., WHERE_TC), and introduce an operator-centric triple extractor to recognize those associated schema items with the predefined relationship. Also, we establish a rule set of grammar constraints via the predicted triples to filter the proper SQL operators and schema items during the SQL generation. On Spider, a cross-domain challenging text-to-SQL benchmark, experimental results indicate that MTSQL is more effective than baselines, especially in extremely hard scenarios. Moreover, further analyses verify that our approach leads to promising improvements for complicated SQL queries.
Computational Modelling of Plurality and Definiteness in Chinese Noun Phrases
Liu, Yuqi, Chen, Guanyi, van Deemter, Kees
Theoretical linguists have suggested that some languages (e.g., Chinese and Japanese) are "cooler" than other languages based on the observation that the intended meaning of phrases in these languages depends more on their contexts. As a result, many expressions in these languages are shortened, and their meaning is inferred from the context. In this paper, we focus on the omission of the plurality and definiteness markers in Chinese noun phrases (NPs) to investigate the predictability of their intended meaning given the contexts. To this end, we built a corpus of Chinese NPs, each of which is accompanied by its corresponding context, and by labels indicating its singularity/plurality and definiteness/indefiniteness. We carried out corpus assessments and analyses. The results suggest that Chinese speakers indeed drop plurality and definiteness markers very frequently. Building on the corpus, we train a bank of computational models using both classic machine learning models and state-of-the-art pre-trained language models to predict the plurality and definiteness of each NP. We report on the performance of these models and analyse their behaviours.
FCDS: Fusing Constituency and Dependency Syntax into Document-Level Relation Extraction
Zhu, Xudong, Kang, Zhao, Hui, Bei
Document-level Relation Extraction (DocRE) aims to identify relation labels between entities within a single document. It requires handling several sentences and reasoning over them. State-of-the-art DocRE methods use a graph structure to connect entities across the document to capture dependency syntax information. However, this is insufficient to fully exploit the rich syntax information in the document. In this work, we propose to fuse constituency and dependency syntax into DocRE. It uses constituency syntax to aggregate the whole sentence information and select the instructive sentences for the pairs of targets. It exploits the dependency syntax in a graph structure with constituency syntax enhancement and chooses the path between entity pairs based on the dependency graph. The experimental results on datasets from various domains demonstrate the effectiveness of the proposed method. The code is publicly available at https://github.com/xzAscC/FCDS.
Automated Generation of Multiple-Choice Cloze Questions for Assessing English Vocabulary Using GPT-turbo 3.5
Wang, Qiao, Rose, Ralph, Orita, Naho, Sugawara, Ayaka
A common way of assessing language learners' mastery of vocabulary is via multiple-choice cloze (i.e., fill-in-the-blank) questions. But the creation of test items can be laborious for individual teachers or in large-scale language programs. In this paper, we evaluate a new method for automatically generating these types of questions using large language models (LLM). The VocaTT (vocabulary teaching and training) engine is written in Python and comprises three basic steps: pre-processing target word lists, generating sentences and candidate word options using GPT, and finally selecting suitable word options. To test the efficiency of this system, 60 questions were generated targeting academic words. The generated items were reviewed by expert reviewers who judged the well-formedness of the sentences and word options, adding comments to items judged not well-formed. Results showed a 75% rate of well-formedness for sentences and 66.85% rate for suitable word options. This is a marked improvement over the generator used earlier in our research which did not take advantage of GPT's capabilities. Post-hoc qualitative analysis reveals several points for improvement in future work including cross-referencing part-of-speech tagging, better sentence validation, and improving GPT prompts.
ModelWriter: Text & Model-Synchronized Document Engineering Platform
Erata, Ferhat, Gardent, Claire, Gyawali, Bikash, Shimorina, Anastasia, Lussaud, Yvan, Tekinerdogan, Bedir, Kardas, Geylani, Monceaux, Anne
The ModelWriter platform provides a generic framework for automated traceability analysis. In this paper, we demonstrate how this framework can be used to trace the consistency and completeness of technical documents that consist of a set of System Installation Design Principles used by Airbus to ensure the correctness of aircraft system installation. We show in particular, how the platform allows the integration of two types of reasoning: reasoning about the meaning of text using semantic parsing and description logic theorem proving; and reasoning about document structure using first-order relational logic and finite model finding for traceability analysis.
A Compositional Typed Semantics for Universal Dependencies
Bradford, Laurestine, O'Donnell, Timothy John, Reddy, Siva
Languages may encode similar meanings using different sentence structures. This makes it a challenge to provide a single set of formal rules that can derive meanings from sentences in many languages at once. To overcome the challenge, we can take advantage of language-general connections between meaning and syntax, and build on cross-linguistically parallel syntactic structures. We introduce UD Type Calculus, a compositional, principled, and language-independent system of semantic types and logical forms for lexical items which builds on a widely-used language-general dependency syntax framework. We explain the essential features of UD Type Calculus, which all involve giving dependency relations denotations just like those of words. These allow UD-TC to derive correct meanings for sentences with a wide range of syntactic structures by making use of dependency labels. Finally, we present evaluation results on a large existing corpus of sentences and their logical forms, showing that UD-TC can produce meanings comparable with our baseline.
Ensemble-Based Unsupervised Discontinuous Constituency Parsing by Tree Averaging
Shayegh, Behzad, Wen, Yuqiao, Mou, Lili
We address unsupervised discontinuous constituency parsing, where we observe a high variance in the performance of the only previous model. We propose to build an ensemble of different runs of the existing discontinuous parser by averaging the predicted trees, to stabilize and boost performance. To begin with, we provide comprehensive computational complexity analysis (in terms of P and NP-complete) for tree averaging under different setups of binarity and continuity. We then develop an efficient exact algorithm to tackle the task, which runs in a reasonable time for all samples in our experiments. Results on three datasets show our method outperforms all baselines in all metrics; we also provide in-depth analyses of our approach.