Grammars & Parsing
Strongly Incremental Constituency Parsing with Graph Neural Networks
Parsing sentences into syntax trees can benefit downstream applications in NLP. Transition-based parsers build trees by executing actions in a state transition system. They are computationally efficient, and can leverage machine learning to predict actions based on partial trees. However, existing transition-based parsers are predominantly based on the shift-reduce transition system, which does not align with how humans are known to parse sentences. Psycholinguistic research suggests that human parsing is strongly incremental--humans grow a single parse tree by adding exactly one token at each step.
Scaling Graph-Based Dependency Parsing with Arc Vectorization and Attention-Based Refinement
Floquet, Nicolas, Roux, Joseph Le, Tomeh, Nadi, Charnois, Thierry
We propose a novel architecture for graph-based dependency parsing that explicitly constructs vectors, from which both arcs and labels are scored. Our method addresses key limitations of the standard two-pipeline approach by unifying arc scoring and labeling into a single network, reducing scalability issues caused by the information bottleneck and lack of parameter sharing. Additionally, our architecture overcomes limited arc interactions with transformer layers to efficiently simulate higher-order dependencies. Experiments on PTB and UD show that our model outperforms state-of-the-art parsers in both accuracy and efficiency.
Algorithm for Semantic Network Generation from Texts of Low Resource Languages Such as Kiswahili
Wanjawa, Barack Wamkaya, Muchemi, Lawrence, Miriti, Evans
Box 30197 Nairobi 00100, Kenya eamiriti@uonbi.ac.ke Abstract Processing low-resource languages, such as Kiswahili, using machine learning is difficult due to lack of adequate training data. However, such low-resource languages are still important for human communication and are already in daily use and users need practical machine processing tasks such as summarization, disambiguation and even question answering (QA). One method of processing such languages, while bypassing the need for training data, is the use semantic networks. Some low resource languages, such as Kiswahili, are of the subject-verb-object (SVO) structure, and similarly semantic networks are a triple of subject-predicate-object, hence SVO parts of speech tags can map into a semantic network triple. An algorithm to process raw natural language text and map it into a semantic network is therefore necessary and desirable in structuring low resource languages texts. This algorithm tested on the Kiswahili QA task with upto 78.6% exact match. Highlights Languages, both low and high-resource are important for communication. Low resource languages lack vast data repositories necessary for machine learning. Use of language part of speech tags can create meaning from the language. An algorithm can create semantic networks out of the language parts of speech. The semantic network of the language can do practical tasks such as QA.
Foundations of Large Language Models
The development of neural sequence models, such as Transformers [Vaswani et al., 2017], along with the improvements in large-scale self-supervised learning, has opened the door to universal language understanding and generation. This achievement is largely motivated by pre-training: we separate common components from many neural network-based systems, and then train them on huge amounts of unlabeled data using self-supervision. These pre-trained models serve as foundation models that can be easily adapted to different tasks via fine-tuning or prompting. As a result, the paradigm of NLP has been enormously changed. In many cases, large-scale supervised learning for specific tasks is no longer required, and instead, we only need to adapt pre-trained foundation models.
Measuring and Reducing Model Update Regression in Structured Prediction for NLP
Recent advance in deep learning has led to rapid adoption of machine learning based NLP models in a wide range of applications. Despite the continuous gain in accuracy, backward compatibility is also an important aspect for industrial applications, yet it received little research attention. Backward compatibility requires that the new model does not regress on cases that were correctly handled by its predecessor. This work studies model update regression in structured prediction tasks. We choose syntactic dependency parsing and conversational semantic parsing as representative examples of structured prediction tasks in NLP.
ESURF: Simple and Effective EDU Segmentation
Sediqin, Mohammadreza, Argamon, Shlomo Engelson
A fundamental task in natural language understanding Despite the successes of contextualized pretrained is analyzing the overall structure of a text, language models (PLMs) like XLNet (Yang so that logical and coherence relations between et al., 2020) in RST discourse parsing, challenges text units are revealed. Rhetorical Structure Theory remain due to data insufficiency, reliance on lexical (RST) (Mann and Thompson, 1988) is a wellacccepted and syntactic clues, and inconsistencies between theoretical framework for the task within EDU-level parsing and sentence-level contextual the NLP community (Kobayashi et al., 2020).
Multimodal Structure-Aware Quantum Data Processing
Hawashin, Hala, Sadrzadeh, Mehrnoosh
While large language models (LLMs) have advanced the field of natural language processing (NLP), their "black box" nature obscures their decision-making processes. To address this, researchers developed structured approaches using higher order tensors. These are able to model linguistic relations, but stall when training on classical computers due to their excessive size. Tensors are natural inhabitants of quantum systems and training on quantum computers provides a solution by translating text to variational quantum circuits. In this paper, we develop MultiQ-NLP: a framework for structure-aware data processing with multimodal text+image data. Here, "structure" refers to syntactic and grammatical relationships in language, as well as the hierarchical organization of visual elements in images. We enrich the translation with new types and type homomorphisms and develop novel architectures to represent structure. When tested on a main stream image classification task (SVO Probes), our best model showed a par performance with the state of the art classical models; moreover the best model was fully structured.
Automating Date Format Detection for Data Visualization
--Data preparation, specifically date parsing, is a significant bottleneck in analytic workflows. T o address this, we present two algorithms--one based on minimum entropy and the other on natural language modeling--that automatically derive date formats from string data. These algorithms achieve over 90% accuracy on a large corpus of data columns, streamlining the data preparation process within visualization environments. The minimal entropy approach is particularly fast, providing interactive feedback. Our methods simplify date format extraction, making them suitable for integration into data visualization tools and databases. Lately, the coordination of information perception advancements like Polaris [1] and Spotfire [2] has featured the significance of joining computational power with human knowledge for successful information examination. While PCs succeed at handling huge datasets, people bring significant space skill and the capacity to perceive designs visually [3], [4]. Frameworks that influence both human criticism and machine handling demonstrate additional success in separating significant experiences from information. Intuitive perception frameworks have become fundamental for empowering clients to investigate information while keeping up with their scientific stream.
Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models
Özateş, Şaziye Betül, Tıraş, Tarık Emre, Adak, Ece Elif, Doğan, Berat, Karagöz, Fatih Burak, Genç, Efe Eren, Taşdemir, Esma F. Bilgin
This paper introduces foundational resources and models for natural language processing (NLP) of historical Turkish, a domain that has remained underexplored in computational linguistics. We present the first named entity recognition (NER) dataset, HisTR and the first Universal Dependencies treebank, OTA-BOUN for a historical form of the Turkish language along with transformer-based models trained using these datasets for named entity recognition, dependency parsing, and part-of-speech tagging tasks. Additionally, we introduce Ottoman Text Corpus (OTC), a clean corpus of transliterated historical Turkish texts that spans a wide range of historical periods. Our experimental results show significant improvements in the computational analysis of historical Turkish, achieving promising results in tasks that require understanding of historical linguistic structures. They also highlight existing challenges, such as domain adaptation and language variations across time periods. All of the presented resources and models are made available at https://huggingface.co/bucolin to serve as a benchmark for future progress in historical Turkish NLP.
On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena
Language Models (LMs) have been shown to exhibit a strong preference towards entities associated with Western culture when operating in non-Western languages. In this paper, we aim to uncover the origins of entity-related cultural biases in LMs by analyzing several contributing factors, including the representation of entities in pre-training data and the impact of variations in linguistic phenomena across languages. We introduce CAMeL-2, a parallel Arabic-English benchmark of 58,086 entities associated with Arab and Western cultures and 367 masked natural contexts for entities. Our evaluations using CAMeL-2 reveal reduced performance gaps between cultures by LMs when tested in English compared to Arabic. We find that LMs struggle in Arabic with entities that appear at high frequencies in pre-training, where entities can hold multiple word senses. This also extends to entities that exhibit high lexical overlap with languages that are not Arabic but use the Arabic script. Further, we show how frequency-based tokenization leads to this issue in LMs, which gets worse with larger Arabic vocabularies. We will make CAMeL-2 available at: https://github.com/tareknaous/camel2