AITopics | Grammars & Parsing

Collaborating Authors

Grammars & Parsing

News Overviews Instructional Materials AI-Alerts Classics

gzip Predicts Data-dependent Scaling Laws

arXiv.org Artificial IntelligenceMay-26-2024

Past work has established scaling laws that predict the performance of a neural language model (LM) as a function of its parameter count and the number of tokens it's trained on, enabling optimal allocation of a fixed compute budget. Are these scaling laws agnostic to training data as some prior work suggests? We generate training datasets of varying complexities by modulating the syntactic properties of a PCFG, finding that 1) scaling laws are sensitive to differences in data complexity and that 2) gzip, a compression algorithm, is an effective predictor of how data complexity impacts scaling properties. We propose a new data-dependent scaling law for LM's that accounts for the training data's gzip-compressibility; its compute-optimal frontier increases in dataset size preference (over parameter count preference) as training data becomes harder to compress.

arxiv preprint arxiv, dataset, gzip, (14 more...)

arXiv.org Artificial Intelligence

2405.16684

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Middle East > Jordan (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(3 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Revisiting Meta-evaluation for Grammatical Error Correction

Kobayashi, Masamune, Mita, Masato, Komachi, Mamoru

arXiv.org Artificial IntelligenceMay-26-2024

Metrics are the foundation for automatic evaluation in grammatical error correction (GEC), with their evaluation of the metrics (meta-evaluation) relying on their correlation with human judgments. However, conventional meta-evaluations in English GEC encounter several challenges including biases caused by inconsistencies in evaluation granularity, and an outdated setup using classical systems. These problems can lead to misinterpretation of metrics and potentially hinder the applicability of GEC techniques. To address these issues, this paper proposes SEEDA, a new dataset for GEC meta-evaluation. SEEDA consists of corrections with human ratings along two different granularities: edit-based and sentence-based, covering 12 state-of-the-art systems including large language models (LLMs), and two human corrections with different focuses. The results of improved correlations by aligning the granularity in the sentence-level meta-evaluation, suggest that edit-based metrics may have been underestimated in existing studies. Furthermore, correlations of most metrics decrease when changing from classical to neural systems, indicating that traditional metrics are relatively poor at evaluating fluently corrected sentences with many edits.

computational linguistic, correlation, evaluation, (15 more...)

arXiv.org Artificial Intelligence

2403.02674

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Maryland > Baltimore (0.04)
North America > Canada > Ontario > Toronto (0.04)
(28 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.73)
Information Technology > Data Science > Data Quality > Data Cleaning (0.63)

Add feedback

Enhancing Augmentative and Alternative Communication with Card Prediction and Colourful Semantics

Pereira, Jayr, Rodrigues, Francisco, Pereira, Jaylton, Zanchettin, Cleber, Fidalgo, Robson

arXiv.org Artificial IntelligenceMay-24-2024

This paper presents an approach to enhancing Augmentative and Alternative Communication (AAC) systems by integrating Colourful Semantics (CS) with transformer-based language models specifically tailored for Brazilian Portuguese. We introduce an adapted BERT model, BERTptCS, which incorporates the CS framework for improved prediction of communication cards. The primary aim is to enhance the accuracy and contextual relevance of communication card predictions, which are essential in AAC systems for individuals with complex communication needs (CCN). We compared BERTptCS with a baseline model, BERTptAAC, which lacks CS integration. Our results demonstrate that BERTptCS significantly outperforms BERTptAAC in various metrics, including top-k accuracy, Mean Reciprocal Rank (MRR), and Entropy@K. Integrating CS into the language model improves prediction accuracy and offers a more intuitive and contextual understanding of user inputs, facilitating more effective communication.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2405.15896

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
South America > Brazil > Pernambuco > Recife (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Organic Data-Driven Approach for Turkish Grammatical Error Correction and LLMs

Ersoy, Asım, Yıldız, Olcay Taner

arXiv.org Artificial IntelligenceMay-24-2024

Grammatical Error Correction has seen significant progress with the recent advancements in deep learning. As those methods require huge amounts of data, synthetic datasets are being built to fill this gap. Unfortunately, synthetic datasets are not organic enough in some cases and even require clean data to start with. Furthermore, most of the work that has been done is focused mostly on English. In this work, we introduce a new organic data-driven approach, clean insertions, to build parallel Turkish Grammatical Error Correction datasets from any organic data, and to clean the data used for training Large Language Models. We achieve state-of-the-art results on two Turkish Grammatical Error Correction test sets out of the three publicly available ones. We also show the effectiveness of our method on the training losses of training language models.

correction, dataset, evaluation, (12 more...)

arXiv.org Artificial Intelligence

2405.1532

Country:

Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > Washington > King County > Seattle (0.04)
Europe > Croatia > Dubrovnik-Neretva County > Dubrovnik (0.04)

Genre:

Overview (0.93)
Research Report (0.82)

Industry:

Media > Film (0.49)
Leisure & Entertainment (0.49)

Technology:

Information Technology > Data Science > Data Quality > Data Cleaning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
(2 more...)

Add feedback

ECLIPSE: Semantic Entropy-LCS for Cross-Lingual Industrial Log Parsing

Zhang, Wei, Cheng, Xianfu, Zhang, Yi, Yang, Jian, Guo, Hongcheng, Li, Zhoujun, Yin, Xiaolin, Guan, Xiangyuan, Shi, Xu, Zheng, Liangfan, Zhang, Bo

arXiv.org Artificial IntelligenceMay-24-2024

Log parsing, a vital task for interpreting the vast and complex data produced within software architectures faces significant challenges in the transition from academic benchmarks to the industrial domain. Existing log parsers, while highly effective on standardized public datasets, struggle to maintain performance and efficiency when confronted with the sheer scale and diversity of real-world industrial logs. These challenges are two-fold: 1) massive log templates: The performance and efficiency of most existing parsers will be significantly reduced when logs of growing quantities and different lengths; 2) Complex and changeable semantics: Traditional template-matching algorithms cannot accurately match the log templates of complicated industrial logs because they cannot utilize cross-language logs with similar semantics. To address these issues, we propose ECLIPSE, Enhanced Cross-Lingual Industrial log Parsing with Semantic Entropy-LCS, since cross-language logs can robustly parse industrial logs. On the one hand, it integrates two efficient data-driven template-matching algorithms and Faiss indexing. On the other hand, driven by the powerful semantic understanding ability of the Large Language Model (LLM), the semantics of log keywords were accurately extracted, and the retrieval space was effectively reduced. Notably, we launch a Chinese and English cross-platform industrial log parsing benchmark ECLIPSE- BENCH to evaluate the performance of mainstream parsers in industrial scenarios. Our experimental results across public benchmarks and ECLIPSE- BENCH underscore the superior performance and robustness of our proposed ECLIPSE. Notably, ECLIPSE both delivers state-of-the-art performance when compared to strong baselines and preserves a significant edge in processing efficiency.

eclipse, log template, template, (16 more...)

arXiv.org Artificial Intelligence

2405.13548

Country:

Europe > Belgium > Flanders > East Flanders > Ghent (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.68)

Add feedback

EHR-SeqSQL : A Sequential Text-to-SQL Dataset For Interactively Exploring Electronic Health Records

Ryu, Jaehee, Cho, Seonhee, Lee, Gyubok, Choi, Edward

arXiv.org Artificial IntelligenceMay-23-2024

In this paper, we introduce EHR-SeqSQL, a novel sequential text-to-SQL dataset for Electronic Health Record (EHR) databases. EHR-SeqSQL is designed to address critical yet underexplored aspects in text-to-SQL parsing: interactivity, compositionality, and efficiency. To the best of our knowledge, EHR-SeqSQL is not only the largest but also the first medical text-to-SQL dataset benchmark to include sequential and contextual questions. We provide a data split and the new test set designed to assess compositional generalization ability. Our experiments demonstrate the superiority of a multi-turn approach over a single-turn approach in learning compositionality. Additionally, our dataset integrates specially crafted tokens into SQL queries to improve execution efficiency. With EHR-SeqSQL, we aim to bridge the gap between practical needs and academic research in the text-to-SQL domain.

admission, chartevent, interaction, (15 more...)

arXiv.org Artificial Intelligence

2406.00019

Country: North America > United States > Pennsylvania (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Health Care Technology > Medical Record (1.00)

Technology:

Information Technology > Databases (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.35)

Add feedback

Language processing in humans and computers

Pavlovic, Dusko

arXiv.org Artificial IntelligenceMay-23-2024

Machine-learned language models have transformed everyday life: they steer us when we study, drive, manage money. They have the potential to transform our civilization. But they hallucinate. Their realities are virtual. This note provides a high-level overview of language models and outlines a low-level model of learning machines. It turns out that, after they become capable of recognizing hallucinations and dreaming safely, as humans tend to be, the language-learning machines proceed to generate broader systems of false beliefs and self-confirming theories, as humans tend to do.

chatbot, neural network, vector, (17 more...)

arXiv.org Artificial Intelligence

2405.14233

Country:

North America > United States > Virginia (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
Europe > United Kingdom > England > Buckinghamshire > Milton Keynes (0.04)

Genre:

Research Report (1.00)
Instructional Material > Course Syllabus & Notes (0.45)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)
Education (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Issues (1.00)
(3 more...)

Add feedback

jp-evalb: Robust Alignment-based PARSEVAL Measures

Park, Jungyeul, Wang, Junrui, Jo, Eunkyul Leah, Park, Angela Yoonseo

arXiv.org Artificial IntelligenceMay-22-2024

We introduce an evaluation system designed to compute PARSEVAL measures, offering a viable alternative to \texttt{evalb} commonly used for constituency parsing evaluation. The widely used \texttt{evalb} script has traditionally been employed for evaluating the accuracy of constituency parsing results, albeit with the requirement for consistent tokenization and sentence boundaries. In contrast, our approach, named \texttt{jp-evalb}, is founded on an alignment method. This method aligns sentences and words when discrepancies arise. It aims to overcome several known issues associated with \texttt{evalb} by utilizing the `jointly preprocessed (JP)' alignment-based method. We introduce a more flexible and adaptive framework, ultimately contributing to a more accurate assessment of constituency parsing performance.

alignment, computational linguistic, constituent, (16 more...)

arXiv.org Artificial Intelligence

2405.1415

Country:

North America > Canada > British Columbia (0.05)
North America > United States > California > Monterey County > Pacific Grove (0.04)
Asia > China > Hong Kong (0.04)
(18 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

Holmes: Benchmark the Linguistic Competence of Language Models

Waldis, Andreas, Perlitz, Yotam, Choshen, Leshem, Hou, Yufang, Gurevych, Iryna

arXiv.org Artificial IntelligenceMay-22-2024

We introduce Holmes, a benchmark to assess the linguistic competence of language models (LMs) - their ability to grasp linguistic phenomena. Unlike prior prompting-based evaluations, Holmes assesses the linguistic competence of LMs via their internal representations using classifier-based probing. In doing so, we disentangle specific phenomena (e.g., part-of-speech of words) from other cognitive abilities, like following textual instructions, and meet recent calls to assess LMs' linguistic competence in isolation. Composing Holmes, we review over 250 probing studies and feature more than 200 datasets to assess syntax, morphology, semantics, reasoning, and discourse phenomena. Analyzing over 50 LMs reveals that, aligned with known trends, their linguistic competence correlates with model size. However, surprisingly, model architecture and instruction tuning also significantly influence performance, particularly in morphology and syntax. Finally, we propose FlashHolmes, a streamlined version of Holmes designed to lower the high computation load while maintaining high-ranking precision.

computational linguistic, linguistic, proceedings, (15 more...)

arXiv.org Artificial Intelligence

2404.18923

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.05)
Europe > Italy > Tuscany > Florence (0.04)
(37 more...)

Genre: Research Report > Experimental Study (0.46)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

Linguistic Structure from a Bottleneck on Sequential Information Processing

Futrell, Richard, Hahn, Michael

arXiv.org Artificial IntelligenceMay-20-2024

Human language is a unique form of communication in the natural world, distinguished by its structured nature. Most fundamentally, it is systematic, meaning that signals can be broken down into component parts that are individually meaningful -- roughly, words -- which are combined in a regular way to form sentences. Furthermore, the way in which these parts are combined maintains a kind of locality: words are usually concatenated together, and they form contiguous phrases, keeping related parts of sentences close to each other. We address the challenge of understanding how these basic properties of language arise from broader principles of efficient communication under information processing constraints. Here we show that natural-language-like systematicity arises from minimization of excess entropy, a measure of statistical complexity that represents the minimum amount of information necessary for predicting the future of a sequence based on its past. In simulations, we show that codes that minimize excess entropy factorize their source distributions into approximately independent components, and then express those components systematically and locally. Next, in a series of massively cross-linguistic corpus studies, we show that human languages are structured to have low excess entropy at the level of phonology, morphology, syntax, and semantics. Our result suggests that human language performs a sequential generalization of Independent Components Analysis on the statistical distribution over meanings that need to be expressed. It establishes a link between the statistical and algebraic structure of human language, and reinforces the idea that the structure of human language may have evolved to minimize cognitive load while maximizing communicative expressiveness.

entropy, excess entropy, source distribution, (14 more...)

arXiv.org Artificial Intelligence

2405.12109

Country:

North America > United States > California > Orange County > Irvine (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(15 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.67)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.46)

Add feedback