Grammars & Parsing
Survey of Abstract Meaning Representation: Then, Now, Future
This paper presents a survey of Abstract Meaning Representation (AMR), a semantic representation framework that captures the meaning of sentences through a graph-based structure. AMR represents sentences as rooted, directed acyclic graphs, where nodes correspond to concepts and edges denote relationships, effectively encoding the meaning of complex sentences. This survey investigates AMR and its extensions, focusing on AMR capabilities. It then explores the parsing (text-to-AMR) and generation (AMR-to-text) tasks by showing traditional, current, and possible futures approaches. It also reviews various applications of AMR including text generation, text classification, and information extraction and information seeking. By analyzing recent developments and challenges in the field, this survey provides insights into future directions for research and the potential impact of AMR on enhancing machine understanding of human language.
AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine
Siebenschuh, Carlo, Hippe, Kyle, Gokdemir, Ozan, Brace, Alexander, Khan, Arham, Hossain, Khalid, Babuji, Yadu, Chia, Nicholas, Vishwanath, Venkatram, Stevens, Rick, Ramanathan, Arvind, Foster, Ian, Underwood, Robert
Language models for scientific tasks are trained on text from scientific publications, most distributed as PDFs that require parsing. PDF parsing approaches range from inexpensive heuristics (for simple documents) to computationally intensive ML-driven systems (for complex or degraded ones). The choice of the "best" parser for a particular document depends on its computational cost and the accuracy of its output. To address these issues, we introduce an Adaptive Parallel PDF Parsing and Resource Scaling Engine (AdaParse), a data-driven strategy for assigning an appropriate parser to each document. We enlist scientists to select preferred parser outputs and incorporate this information through direct preference optimization (DPO) into AdaParse, thereby aligning its selection process with human judgment. AdaParse then incorporates hardware requirements and predicted accuracy of each parser to orchestrate computational resources efficiently for large-scale parsing campaigns. We demonstrate that AdaParse, when compared to state-of-the-art parsers, improves throughput by $17\times$ while still achieving comparable accuracy (0.2 percent better) on a benchmark set of 1000 scientific documents. AdaParse's combination of high accuracy and parallel scalability makes it feasible to parse large-scale scientific document corpora to support the development of high-quality, trillion-token-scale text datasets. The implementation is available at https://github.com/7shoe/AdaParse/
Computational Identification of Regulatory Statements in EU Legislation
Brandsma, Gijs Jan, Blom-Hansen, Jens, Meijer, Christiaan, Moodley, Kody
Identifying regulatory statements in legislation is useful for developing metrics to measure the regulatory density and strictness of legislation. A computational method is valuable for scaling the identification of such statements from a growing body of EU legislation, constituting approximately 180,000 published legal acts between 1952 and 2023. Past work on extraction of these statements varies in the permissiveness of their definitions for what constitutes a regulatory statement. In this work, we provide a specific definition for our purposes based on the institutional grammar tool. We develop and compare two contrasting approaches for automatically identifying such statements in EU legislation, one based on dependency parsing, and the other on a transformer-based machine learning model. We found both approaches performed similarly well with accuracies of 80% and 84% respectively and a K alpha of 0.58. The high accuracies and not exceedingly high agreement suggests potential for combining strengths of both approaches.
Enhancing Health Mention Classification Performance: A Study on Advancements in Parameter Efficient Tuning
Abdel-Salam, Reem, Adewunmi, Mary
Health Mention Classification (HMC) plays a critical role in leveraging social media posts for real-time tracking and public health monitoring. Nevertheless, the process of HMC presents significant challenges due to its intricate nature, primarily stemming from the contextual aspects of health mentions, such as figurative language and descriptive terminology, rather than explicitly reflecting a personal ailment. To address this problem, we argue that clearer mentions can be achieved through conventional fine-tuning with enhanced parameters of biomedical natural language methods (NLP). In this study, we explore different techniques such as the utilisation of part-of-speech (POS) tagger information, improving on PEFT techniques, and different combinations thereof. Extensive experiments are conducted on three widely used datasets: RHDM, PHM, and Illness. The results incorporated POS tagger information, and leveraging PEFT techniques significantly improves performance in terms of F1-score compared to state-of-the-art methods across all three datasets by utilising smaller models and efficient training. Furthermore, the findings highlight the effectiveness of incorporating POS tagger information and leveraging PEFT techniques for HMC. In conclusion, the proposed methodology presents a potentially effective approach to accurately classifying health mentions in social media posts while optimising the model size and training efficiency.
A Comprehensive Part-of-Speech Tagging to Standardize Central-Kurdish Language: A Research Guide for Kurdish Natural Language Processing Tasks
Sabr, Shadan Shukr, Mustafa, Nazira Sabr, Omar, Talar Sabah, Rasool, Salah Hwayyiz, Omer, Nawzad Anwer, Hamad, Darya Sabir, Shams, Hemin Abdulhameed, Kareem, Omer Mahmood, Abdullah, Rozhan Noori, Abdullah, Khabat Atar, Mohammad, Mahabad Azad, Al-Raghefy, Haneen, Asaad, Safar M., Mohammed, Sara Jamal, Ali, Twana Saeed, Shawrow, Fazil, Maghdid, Halgurd S.
- The field of natural language processing (NLP) has dramatically expanded within the last decade. Many human-being applications are conducted daily via NLP tasks, starting from machine translation, speech recognition, text generation and recommendations, Part-of-Speech tagging (POS), and Named-Entity Recognition (NER). However, low-resourced languages, such as the Central-Kurdish language (CKL), mainly remain unexamined due to shortage of necessary resources to support their development. The POS tagging task is the base of other NLP tasks; for example, the POS tag set has been used to standardized languages to provide the relationship between words among the sentences, followed by machine translation and text recommendation. Specifically, for the CKL, most of the utilized or provided POS tagsets are neither standardized nor comprehensive. To this end, this study presented an accurate and comprehensive POS tagset for the CKL to provide better performance of the Kurdish NLP tasks. The article also collected most of the POS tags from different studies as well as from Kurdish linguistic experts to standardized part-of-speech tags. The proposed POS tagset is designed to annotate a large CKL corpus and support Kurdish NLP tasks. The initial investigations of this study via comparison with the Universal Dependencies framework for standard languages, show that the proposed POS tagset can streamline or correct sentences more accurately for Kurdish NLP tasks.
SCALAR: A Part-of-speech Tagger for Identifiers
Newman, Christian D., Scholten, Brandon, Testa, Sophia, Behler, Joshua A. C., Banabilah, Syreen, Collard, Michael L., Decker, Michael J., Mkaouer, Mohamed Wiem, Zampieri, Marcos, AlOmar, Eman Abdullah, Alsuhaibani, Reem, Peruma, Anthony, Maletic, Jonathan I.
--The paper presents the Source Code Analysis and Lexical Annotation Runtime (SCALAR), a tool specialized for mapping (annotating) source code identifier names to their corresponding part-of-speech tag sequence (grammar pattern). SCALAR's internal model is trained using scikit-learn's GradientBoostingClassifier in conjunction with a manually-curated oracle of identifier names and their grammar patterns. This specializes the tagger to recognize the unique structure of the natural language used by developers to create all types of identifiers (e.g., function names, variable names etc.). SCALAR's output is compared with a previous version of the tagger, as well as a modern off-the-shelf part-of-speech tagger to show how it improves upon other taggers' output for annotating identifiers. The code is available on Github 1 Index T erms --Program comprehension, identifier naming, part-of-speech tagging, natural language processing, software maintenance, software evolution I. I NTRODUCTION The identifiers developers create represent a significant amount of the information other developers must use to understand related code. Given that identifiers represent, on average, 70% of the characters in a code base [1], and developers spend more time reading code than writing [2], [3], it is important for researchers to better understand of how identifiers convey information, and how they can be improved to increase developer reading efficiency.
Self-Correction Makes LLMs Better Parsers
Zhang, Ziyan, Hou, Yang, Gong, Chen, Li, Zhenghua
Large language models (LLMs) have achieved remarkable success across various natural language processing (NLP) tasks. However, recent studies suggest that they still face challenges in performing fundamental NLP tasks essential for deep language understanding, particularly syntactic parsing. In this paper, we conduct an in-depth analysis of LLM parsing capabilities, delving into the specific shortcomings of their parsing results. We find that LLMs may stem from limitations to fully leverage grammar rules in existing treebanks, which restricts their capability to generate valid syntactic structures. To help LLMs acquire knowledge without additional training, we propose a self-correction method that leverages grammar rules from existing treebanks to guide LLMs in correcting previous errors. Specifically, we automatically detect potential errors and dynamically search for relevant rules, offering hints and examples to guide LLMs in making corrections themselves. Experimental results on three datasets with various LLMs, demonstrate that our method significantly improves performance in both in-domain and cross-domain settings on the English and Chinese datasets.
Persian Abstract Meaning Representation: Annotation Guidelines and Gold Standard Dataset
Takhshid, Reza, Azin, Tara, Shojaei, Razieh, Bahrani, Mohammad
This paper introduces the Persian Abstract Meaning Representation (AMR) guidelines, a detailed guide for annotating Persian sentences with AMR, focusing on the necessary adaptations to fit Persian's unique syntactic structures. We discuss the development process of a Persian AMR gold standard dataset consisting of 1,562 sentences created following the guidelines. By examining the language specifications and nuances that distinguish AMR annotations of a low-resource language like Persian, we shed light on the challenges and limitations of developing a universal meaning representation framework. The guidelines and the dataset introduced in this study highlight such challenges, aiming to advance the field.
A Computational Cognitive Model for Processing Repetitions of Hierarchical Relations
Ren, Zeng, Guan, Xinyi, Rohrmeier, Martin
Patterns are fundamental to human cognition, enabling the recognition of structure and regularity across diverse domains. In this work, we focus on structural repeats, patterns that arise from the repetition of hierarchical relations within sequential data, and develop a candidate computational model of how humans detect and understand such structural repeats. Based on a weighted deduction system, our model infers the minimal generative process of a given sequence in the form of a Template program, a formalism that enriches the context-free grammar with repetition combinators. Such representation efficiently encodes the repetition of sub-computations in a recursive manner. As a proof of concept, we demonstrate the expressiveness of our model on short sequences from music and action planning. The proposed model offers broader insights into the mental representations and cognitive mechanisms underlying human pattern recognition.
Investigating Syntactic Biases in Multilingual Transformers with RC Attachment Ambiguities in Italian and English
Kamerath, Michael, De Santo, Aniello
This paper leverages past sentence processing studies to investigate whether monolingual and multilingual LLMs show human-like preferences when presented with examples of relative clause attachment ambiguities in Italian and English. Furthermore, we test whether these preferences can be modulated by lexical factors (the type of verb/noun in the matrix clause) which have been shown to be tied to subtle constraints on syntactic and semantic relations. Our results overall showcase how LLM behavior varies interestingly across models, but also general failings of these models in correctly capturing human-like preferences. In light of these results, we argue that RC attachment is the ideal benchmark for cross-linguistic investigations of LLMs' linguistic knowledge and biases.