AITopics

2506.01151

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Srivastava, Prerak, Corallo, Giulio, Rybalko, Sergey

A Word is Worth 4-bit: Efficient Log Parsing with Binary Coded Decimal Recognition

arXiv.org Artificial IntelligenceJun-3-2025

System-generated logs are typically converted into categorical log templates through parsing. These templates are crucial for generating actionable insights in various downstream tasks. However, existing parsers often fail to capture fine-grained template details, leading to suboptimal accuracy and reduced utility in downstream tasks requiring precise pattern identification. We propose a character-level log parser utilizing a novel neural architecture that aggregates character embeddings. Our approach estimates a sequence of binary-coded decimals to achieve highly granular log templates extraction. Our low-resource character-level parser, tested on revised Loghub-2k and a manually annotated industrial dataset, matches LLM-based parsers in accuracy while outperforming semantic parsers in efficiency.

machine learning, natural language, template, (18 more...)

2506.01147

Country: Europe (0.68)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Corro, Caio, Lacroix, Mathieu, Roux, Joseph Le

Bregman Conditional Random Fields: Sequence Labeling with Parallelizable Inference Algorithms

arXiv.org Artificial IntelligenceJun-3-2025

We propose a novel discriminative model for sequence labeling called Bregman conditional random fields (BCRF). Contrary to standard linear-chain conditional random fields, BCRF allows fast parallelizable inference algorithms based on iterative Bregman projections. We show how such models can be learned using Fenchel-Young losses, including extension for learning from partial labels. Experimentally, our approach delivers comparable results to CRF while being faster, and achieves better results in highly constrained settings compared to mean field, another parallelizable alternative.

artificial intelligence, machine learning, natural language, (16 more...)

2506.00732

Country:

Europe (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)

Neural Information Processing SystemsJun-1-2025, 22:12:51 GMT

Review for NeurIPS paper: Strongly Incremental Constituency Parsing with Graph Neural Networks

Weaknesses: The following are my concerns (questions) and confirmations of the proposed method. I understand that the number of actions required to parse a sentence for the proposed method is n, where the number of tokens in the sentence is n . However, the computational cost for one action seems relatively very expensive comparing with the existing transition-based algorithm, such as standard shift-reduce parser. Therefore, I suspect that the actual runtime of parsing a single sentence takes much larger than the conventional methods. Regardless of my suspicion is correct or not, the experimental results in the current version does not answer this point.

algorithm, incremental constituency parsing, rebuttal period, (10 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

arXiv.org Artificial IntelligenceMay-30-2025

Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages

Dobrovoljc, Kaja

This paper presents a novel treebank-driven approach to comparing syntactic structures in speech and writing using dependency-parsed corpora. Adopting a fully inductive, bottom-up method, we define syntactic structures as delexicalized dependency (sub)trees and extract them from spoken and written Universal Dependencies (UD) treebanks in two syntactically distinct languages, English and Slovenian. For each corpus, we analyze the size, diversity, and distribution of syntactic inventories, their overlap across modalities, and the structures most characteristic of speech. Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities. Strikingly, the overlap between spoken and written syntactic inventories is very limited: most structures attested in speech do not occur in writing, pointing to modality-specific preferences in syntactic organization that reflect the distinct demands of real-time interaction and elaborated writing. This contrast is further supported by a keyness analysis of the most frequent speech-specific structures, which highlights patterns associated with interactivity, context-grounding, and economy of expression. We argue that this scalable, language-independent framework offers a useful general method for systematically studying syntactic variation across corpora, laying the groundwork for more comprehensive data-driven theories of grammar in use.

artificial intelligence, corpora, natural language, (15 more...)

2505.22774

Country:

Europe > Netherlands (0.28)
North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Issam, Abderrahmane, Semerci, Yusuf Can, Scholtes, Jan, Spanakis, Gerasimos

A Representation Level Analysis of NMT Model Robustness to Grammatical Errors

Understanding robustness is essential for building reliable NLP systems. Unfortunately, in the context of machine translation, previous work mainly focused on documenting robustness failures or improving robustness. In contrast, we study robustness from a model representation perspective by looking at internal model representations of ungrammatical inputs and how they evolve through model layers. For this purpose, we perform Grammatical Error Detection (GED) probing and representational similarity analysis. Our findings indicate that the encoder first detects the grammatical error, then corrects it by moving its representation toward the correct form. To understand what contributes to this process, we turn to the attention mechanism where we identify what we term Robustness Heads. We find that Robustness Heads attend to interpretable linguistic units when responding to grammatical errors, and that when we fine-tune models for robustness, they tend to rely more on Robustness Heads for updating the ungrammatical word representation.

computational linguistic, machine learning, natural language, (18 more...)

2505.21224

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.66)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Scaling and Prompting for Improved End-to-End Spoken Grammatical Error Correction

Qian, Mengjie, Ma, Rao, Bannò, Stefano, Knill, Kate M., Gales, Mark J. F.

Spoken Grammatical Error Correction (SGEC) and Feedback (SGECF) are crucial for second language learners, teachers and test takers. Traditional SGEC systems rely on a cascaded pipeline consisting of an ASR, a module for disfluency detection (DD) and removal and one for GEC. With the rise of end-to-end (E2E) speech foundation models, we investigate their effectiveness in SGEC and feedback generation. This work introduces a pseudo-labelling process to address the challenge of limited labelled data, expanding the training data size from 77 hours to approximately 2500 hours, leading to improved performance. Additionally, we prompt an E2E Whisper-based SGEC model with fluent transcriptions, showing a slight improvement in SGEC performance, with more significant gains in feedback generation. Finally, we assess the impact of increasing model size, revealing that while pseudo-labelled data does not yield performance gain for a larger Whisper model, training with prompts proves beneficial.

data quality, machine learning, natural language, (14 more...)

2505.21137

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.29)
North America > Mexico (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.95)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.75)
Information Technology > Data Science > Data Quality > Data Cleaning (0.65)

Aquino, Angelina A., Miranda, Lester James V., Or, Elsie Marie T.

The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project

This paper presents UD-NewsCrawl, the largest Tagalog treebank to date, containing 15.6k trees manually annotated according to the Universal Dependencies framework. We detail our treebank development process, including data collection, pre-processing, manual annotation, and quality assurance procedures. We provide baseline evaluations using multiple transformer-based models to assess the performance of state-of-the-art dependency parsers on Tagalog. We also highlight challenges in the syntactic analysis of Tagalog given its distinctive grammatical properties, and discuss its implications for the annotation of this treebank. We anticipate that UD-NewsCrawl and our baseline model implementations will serve as valuable resources for advancing computational linguistics research in underrepresented languages like Tagalog.

computational linguistic, large language model, machine learning, (19 more...)

2505.20428

Country:

Europe (1.00)
Asia > Middle East (0.67)
North America > United States > Minnesota (0.28)
Asia > Philippines > Luzon > National Capital Region > City of Manila (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine (0.93)
Leisure & Entertainment > Sports > Basketball (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Ragsdale, Jarrod, Boppana, Rajendra

ShIOEnv: A CLI Behavior-Capturing Environment Enabling Grammar-Guided Command Synthesis for Dataset Curation

Command-line interfaces (CLIs) provide structured textual environments for system administration. Explorations have been performed using pre-trained language models (PLMs) to simulate these environments for safe interaction in high-risk environments. However, their use has been constrained to frozen, large parameter models like GPT. For smaller architectures to reach a similar level of believability, a rich dataset of CLI interactions is required. Existing public datasets focus on mapping natural-language tasks to commands, omitting crucial execution data such as exit codes, outputs, and environmental side effects, limiting their usability for behavioral modeling. We introduce a Shell Input -Output Environment (ShIOEnv), which casts command construction as a Markov Decision Process whose state is the partially built sequence and whose actions append arguments. After each action, ShIOEnv executes the candidate and returns its exit status, output, and progress toward a minimal-length behavioral objective. Due to the intractable nature of the combinatorial argument state-action space, we derive a context-free grammar from man pages to mask invalid arguments from being emitted. We explore random and proximal-policy optimization (PPO)-optimized sampling of unrestricted and grammar-masked action spaces to produce four exploration strategies. We observed that grammar masking and PPO significantly improve sample efficiency to produce a higher quality dataset (maximizing the number of arguments while minimizing redundancies). Policy-generated datasets of shell input-output behavior pairs are used to fine-tune CodeT5, where we observe 85% improvements in BLEU-4 when constraining the action space to grammar productions with an additional 26% improvement when applying PPO. The ShIOEnv environment and curated command behavior datasets are released for use in future research.

machine learning, natural language, reinforcement learning, (14 more...)

2505.18374

Country: North America > United States > Texas (0.14)

Genre: Research Report (0.64)

Industry:

Information Technology > Security & Privacy (0.46)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.48)

Neural Information Processing SystemsMay-27-2025, 18:04:35 GMT

Tracing Hyperparameter Dependencies for Model Parsing via Learnable Graph Pooling Network

Since a diverse set of hyperparameters is jointly employed by the generative model, and dependencies often exist among them, it is crucial to learn these hyperparameter dependencies for improving the model parsing performance. To explore such important dependencies, we propose a novel model parsing method called Learnable Graph Pooling Network (LGPN), in which we formulate model parsing as a graph node classification problem, using graph nodes and edges to represent hyperparameters and their dependencies, respectively. Furthermore, LGPN incorporates a learnable pooling-unpooling mechanism tailored to model parsing, which adaptively learns hyperparameter dependencies of GMs used to generate the input image. Also, we introduce a Generation Trace Capturing Network (GTC) that can efficiently identify generation traces of input images, enhancing the understanding of generated images' provenances.Empirically, we achieve state-of-the-art performance in model parsing and its extended applications, showing the superiority of the proposed LGPN.

artificial intelligence, learnable graph pooling network, natural language, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)