Grammars & Parsing
Towards End-User Development for IoT: A Case Study on Semantic Parsing of Cooking Recipes for Programming Kitchen Devices
Ventirozos, Filippos, Clinch, Sarah, Batista-Navarro, Riza
Semantic parsing of user-generated instructional text, in the way of enabling end-users to program the Internet of Things (IoT), is an underexplored area. In this study, we provide a unique annotated corpus which aims to support the transformation of cooking recipe instructions to machine-understandable commands for IoT devices in the kitchen. Each of these commands is a tuple capturing the semantics of an instruction involving a kitchen device in terms of "What", "Where", "Why" and "How". Based on this corpus, we developed machine learning-based sequence labelling methods, namely conditional random fields (CRF) and a neural network model, in order to parse recipe instructions and extract our tuples of interest from them. Our results show that while it is feasible to train semantic parsers based on our annotations, most natural-language instructions are incomplete, and thus transforming them into formal meaning representation, is not straightforward.
Exploring the Landscape of Natural Language Processing Research
Schopf, Tim, Arabi, Karim, Matthes, Florian
As an efficient approach to understand, generate, and process natural language texts, research in natural language processing (NLP) has exhibited a rapid spread and wide adoption in recent years. Given the increasing research work in this area, several NLP-related approaches have been surveyed in the research community. However, a comprehensive study that categorizes established topics, identifies trends, and outlines areas for future research remains absent. Contributing to closing this gap, we have systematically classified and analyzed research papers in the ACL Anthology. As a result, we present a structured overview of the research landscape, provide a taxonomy of fields of study in NLP, analyze recent developments in NLP, summarize our findings, and highlight directions for future work.
Spanish Resource Grammar version 2023
Zamaraeva, Olga, Gómez-Rodríguez, Carlos
We present the latest version of the Spanish Resource Grammar (SRG). The new SRG uses the recent version of Freeling morphological analyzer and tagger and is accompanied by a manually verified treebank and a list of documented issues. We also present the grammar's coverage and overgeneration on a small portion of a learner corpus, an entirely new research line with respect to the SRG. The grammar can be used for linguistic research, such as for empirically driven development of syntactic theory, and in natural language processing applications such as computer-assisted language learning. Finally, as the treebanks grow, they can be used for training high-quality semantic parsers and other systems which may benefit from precise and detailed semantics.
Natural Language Processing for Requirements Formalization: How to Derive New Approaches?
Sudhi, Viju, Kutty, Libin, Gröpler, Robin
It is a long-standing desire of industry and research to automate the software development and testing process as much as possible. In this process, requirements engineering (RE) plays a fundamental role for all other steps that build on it. Model-based design and testing methods have been developed to handle the growing complexity and variability of software systems. However, major effort is still required to create specification models from a large set of functional requirements provided in natural language. Numerous approaches based on natural language processing (NLP) have been proposed in the literature to generate requirements models using mainly syntactic properties. Recent advances in NLP show that semantic quantities can also be identified and used to provide better assistance in the requirements formalization process. In this work, we present and discuss principal ideas and state-of-the-art methodologies from the field of NLP in order to guide the readers on how to create a set of rules and methods for the semi-automated formalization of requirements according to their specific use case and needs. We discuss two different approaches in detail and highlight the iterative development of rule sets. The requirements models are represented in a human- and machine-readable format in the form of pseudocode. The presented methods are demonstrated on two industrial use cases from the automotive and railway domains. It shows that using current pre-trained NLP models requires less effort to create a set of rules and can be easily adapted to specific use cases and domains. In addition, findings and shortcomings of this research area are highlighted and an outlook on possible future developments is given.
JCoLA: Japanese Corpus of Linguistic Acceptability
Someya, Taiga, Sugimoto, Yushi, Oseki, Yohei
Neural language models have exhibited outstanding performance in a range of downstream tasks. However, there is limited understanding regarding the extent to which these models internalize syntactic knowledge, so that various datasets have recently been constructed to facilitate syntactic evaluation of language models across languages. In this paper, we introduce JCoLA (Japanese Corpus of Linguistic Acceptability), which consists of 10,020 sentences annotated with binary acceptability judgments. Specifically, those sentences are manually extracted from linguistics textbooks, handbooks and journal articles, and split into in-domain data (86 %; relatively simple acceptability judgments extracted from textbooks and handbooks) and out-of-domain data (14 %; theoretically significant acceptability judgments extracted from journal articles), the latter of which is categorized by 12 linguistic phenomena. We then evaluate the syntactic knowledge of 9 different types of Japanese language models on JCoLA. The results demonstrated that several models could surpass human performance for the in-domain data, while no models were able to exceed human performance for the out-of-domain data. Error analyses by linguistic phenomena further revealed that although neural language models are adept at handling local syntactic dependencies like argument structure, their performance wanes when confronted with long-distance syntactic dependencies like verbal agreement and NPI licensing.
Is It Really Useful to Jointly Parse Constituency and Dependency Trees? A Revisit
Gu, Yanggang, Hou, Yang, Wang, Zhefeng, Duan, Xinyu, Li, Zhenghua
This work visits the topic of jointly parsing constituency and dependency trees, i.e., to produce compatible constituency and dependency trees simultaneously for input sentences, which is attractive considering that the two types of trees are complementary in representing syntax. Compared with previous works, we make progress in four aspects: (1) adopting a much more efficient decoding algorithm, (2) exploring joint modeling at the training phase, instead of only at the inference phase, (3) proposing high-order scoring components for constituent-dependency interaction, (4) gaining more insights via in-depth experiments and analysis.
Syntactic Variation Across the Grammar: Modelling a Complex Adaptive System
While language is a complex adaptive system, most work on syntactic variation observes a few individual constructions in isolation from the rest of the grammar. This means that the grammar, a network which connects thousands of structures at different levels of abstraction, is reduced to a few disconnected variables. This paper quantifies the impact of such reductions by systematically modelling dialectal variation across 49 local populations of English speakers in 16 countries. We perform dialect classification with both an entire grammar as well as with isolated nodes within the grammar in order to characterize the syntactic differences between these dialects. The results show, first, that many individual nodes within the grammar are subject to variation but, in isolation, none perform as well as the grammar as a whole. This indicates that an important part of syntactic variation consists of interactions between different parts of the grammar. Second, the results show that the similarity between dialects depends heavily on the sub-set of the grammar being observed: for example, New Zealand English could be more similar to Australian English in phrasal verbs but at the same time more similar to UK English in dative phrases.
GECTurk: Grammatical Error Correction and Detection Dataset for Turkish
Kara, Atakan, Sofian, Farrin Marouf, Bond, Andrew, Şahin, Gözde Gül
Grammatical Error Detection and Correction (GEC) tools have proven useful for native speakers and second language learners. Developing such tools requires a large amount of parallel, annotated data, which is unavailable for most languages. Synthetic data generation is a common practice to overcome the scarcity of such data. However, it is not straightforward for morphologically rich languages like Turkish due to complex writing rules that require phonological, morphological, and syntactic information. In this work, we present a flexible and extensible synthetic data generation pipeline for Turkish covering more than 20 expert-curated grammar and spelling rules (a.k.a., writing rules) implemented through complex transformation functions. Using this pipeline, we derive 130,000 high-quality parallel sentences from professionally edited articles. Additionally, we create a more realistic test set by manually annotating a set of movie reviews. We implement three baselines formulating the task as i) neural machine translation, ii) sequence tagging, and iii) prefix tuning with a pretrained decoder-only model, achieving strong results. Furthermore, we perform exhaustive experiments on out-of-domain datasets to gain insights on the transferability and robustness of the proposed approaches. Our results suggest that our corpus, GECTurk, is high-quality and allows knowledge transfer for the out-of-domain setting. To encourage further research on Turkish GEC, we release our datasets, baseline models, and the synthetic data generation pipeline at https://github.com/GGLAB-KU/gecturk.
Assessment of Pre-Trained Models Across Languages and Grammars
Muñoz-Ortiz, Alberto, Vilares, David, Gómez-Rodríguez, Carlos
We present an approach for assessing how multilingual large language models (LLMs) learn syntax in terms of multi-formalism syntactic structures. We aim to recover constituent and dependency structures by casting parsing as sequence labeling. To do so, we select a few LLMs and study them on 13 diverse UD treebanks for dependency parsing and 10 treebanks for constituent parsing. Our results show that: (i) the framework is consistent across encodings, (ii) pre-trained word vectors do not favor constituency representations of syntax over dependencies, (iii) sub-word tokenization is needed to represent syntax, in contrast to character-based models, and (iv) occurrence of a language in the pretraining data is more important than the amount of task data when recovering syntax from the word vectors.