AITopics

2510.04945

Country:

Europe (1.00)
North America > Mexico (0.94)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.85)

Vadehra, Ankit, Johnson, Bill, Saunders, Gene, Poupart, Pascal

Time Is Effort: Estimating Human Post-Editing Time for Grammar Error Correction Tool Evaluation

arXiv.org Artificial IntelligenceOct-7-2025

Text editing can involve several iterations of revision. Incorporating an efficient Grammar Error Correction (GEC) tool in the initial correction round can significantly impact further human editing effort and final text quality. This raises an interesting question to quantify GEC Tool usability: How much effort can the GEC Tool save users? We present the first large-scale dataset of post-editing (PE) time annotations and corrections for two English GEC test datasets (BEA19 and CoNLL14). We introduce Post-Editing Effort in Time (PEET) for GEC Tools as a human-focused evaluation scorer to rank any GEC Tool by estimating PE time-to-correct. Using our dataset, we quantify the amount of time saved by GEC Tools in text editing. Analyzing the edit type indicated that determining whether a sentence needs correction and edits like paraphrasing and punctuation changes had the greatest impact on PE time. Finally, comparison with human rankings shows that PEET correlates well with technical effort judgment, providing a new human-centric direction for evaluating GEC tool usability. We release our dataset and code at: https://github.com/ankitvad/PEET_Scorer.

large language model, machine learning, natural language, (17 more...)

2510.04394

Country:

Europe (1.00)
North America > United States > California (0.28)

Genre:

Research Report (0.82)
Overview (0.68)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.95)
(3 more...)

arXiv.org Artificial IntelligenceOct-7-2025

LongTail-Swap: benchmarking language models' abilities on rare words

Algayres, Robin, Saint-James, Charles-Éric, Luthra, Mahi, Shen, Jiayi, Lin, Dongyan, Benchekroun, Youssef, Moritz, Rashel, Pino, Juan, Dupoux, Emmanuel

Children learn to speak with a low amount of data and can be taught new words on a few-shot basis, making them particularly data-efficient learners. The BabyLM challenge aims at exploring language model (LM) training in the low-data regime but uses metrics that concentrate on the head of the word distribution. Here, we introduce LongTail-Swap (LT-Swap), a benchmark that focuses on the tail of the distribution, i.e., measures the ability of LMs to learn new words with very little exposure, like infants do. LT-Swap is a pretraining corpus-specific test set of acceptable versus unacceptable sentence pairs that isolate semantic and syntactic usage of rare words. Models are evaluated in a zero-shot fashion by computing the average log probabilities over the two members of each pair. We built two such test sets associated with the 10M words and 100M words BabyLM training sets, respectively, and evaluated 16 models from the BabyLM leaderboard. Our results not only highlight the poor performance of language models on rare words but also reveal that performance differences across LM architectures are much more pronounced in the long tail than in the head. This offers new insights into which architectures are better at handling rare word generalization. We've also made the code publicly avail

frequency bin, large language model, machine learning, (18 more...)

2510.04268

Country: Europe (0.67)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree

Zhang, Yilin, Zhao, Xinran, Wang, Zora Zhiruo, Yang, Chenyang, Wei, Jiayi, Wu, Tongshuang

Retrieval-Augmented Generation (RAG) has become essential for large-scale code generation, grounding predictions in external code corpora to improve actuality. However, a critical yet underexplored aspect of RAG pipelines is chunking -- the process of dividing documents into retrievable units. Existing line-based chunking heuristics often break semantic structures, splitting functions or merging unrelated code, which can degrade generation quality. We propose chunking via Abstract Syntax Trees (\ourwork), a structure-aware method that recursively breaks large AST nodes into smaller chunks and merges sibling nodes while respecting size limits. This approach generates self-contained, semantically coherent units across programming languages and tasks, improving performance on diverse code generation tasks, e.g., boosting Recall@5 by 4.3 points on RepoEval retrieval and Pass@1 by 2.67 points on SWE-bench generation. Our work highlights the importance of structure-aware chunking for scaling retrieval-enhanced code intelligence.

large language model, machine learning, natural language, (18 more...)

2506.15655

Country:

Asia (0.68)
North America (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

A Computational Framework for Interpretable Text-Based Personality Assessment from Social Media

Gjurković, Matej

Personality refers to individual differences in behavior, thinking, and feeling. With the growing availability of digital footprints, especially from social media, automated methods for personality assessment have become increasingly important. Natural language processing (NLP) enables the analysis of unstructured text data to identify personality indicators. However, two main challenges remain central to this thesis: the scarcity of large, personality-labeled datasets and the disconnect between personality psychology and NLP, which restricts model validity and interpretability. To address these challenges, this thesis presents two datasets -- MBTI9k and PANDORA -- collected from Reddit, a platform known for user anonymity and diverse discussions. The PANDORA dataset contains 17 million comments from over 10,000 users and integrates the MBTI and Big Five personality models with demographic information, overcoming limitations in data size, quality, and label coverage. Experiments on these datasets show that demographic variables influence model validity. In response, the SIMPA (Statement-to-Item Matching Personality Assessment) framework was developed - a computational framework for interpretable personality assessment that matches user-generated statements with validated questionnaire items. By using machine learning and semantic similarity, SIMPA delivers personality assessments comparable to human evaluations while maintaining high interpretability and efficiency. Although focused on personality assessment, SIMPA's versatility extends beyond this domain. Its model-agnostic design, layered cue detection, and scalability make it suitable for various research and practical applications involving complex label taxonomies and variable cue associations with target concepts.

large language model, machine learning, myer-briggs type indicator, (24 more...)

2510.02811

Country:

Asia (1.00)
North America > United States > California (0.45)
Europe > United Kingdom > England (0.27)
North America > United States > Minnesota (0.27)

Genre:

Workflow (1.00)
Research Report > New Finding (1.00)
Overview (1.00)
(3 more...)

Industry:

Health & Medicine (1.00)
Education (1.00)
Information Technology > Security & Privacy (0.92)
(3 more...)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(5 more...)

Schulz, Laura Ying, Mitropolsky, Daniel, Poggio, Tomaso

Unraveling Syntax: How Language Models Learn Context-Free Grammars

We introduce a new framework for understanding how language models acquire syntax. While large models achieve impressive results, little is known about their learning dynamics. Our approach starts with the observation that most domains of interest, such as natural language syntax, coding languages, arithmetic problems, are captured by probabilistic context-free grammars (PCFGs). We study the learning dynamics of small models trained on synthetic languages generated from PCFGs, enabling precise control over grammar complexity, recursion depth, and subgrammar structure. We prove several general, recursive formulae for the training loss and Kullback-Leibler divergence over the subgrammar structure of a PCFG. Empirically, we find that unlike children, who first master simple substructures before progressing to more complex constructions, transformers reduce loss across all subgrammars in parallel. We further show that subgrammar pretraining can improve the final loss for smaller models, and that pretrained models develop internal representations more aligned with the grammar's substructure. Finally, we demonstrate that models struggle with deeper recursive structures (a limitation even of large language models), revealing fundamental challenges in how neural networks represent hierarchical syntax. Overall, our work initiates the study of the learning dynamics of transformers on PCFGs as a versatile testbed for probing learning in language models, opening a research direction with many open questions.

machine learning, natural language, subgrammar, (15 more...)

2510.02524

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Patwardhan, Manasi, Agarwal, Ayush, Bhaisaheb, Shabbirhussain, Arora, Aseem, Vig, Lovekesh, Sarawagi, Sunita

Retrieval and Augmentation of Domain Knowledge for Text-to-SQL Semantic Parsing

Abstract--The performance of Large Language Models (LLMs) for translating Natural Language (NL) queries into SQL varies significantly across databases (DBs). NL queries are often expressed using a domain specific vocabulary, and mapping these to the correct SQL requires an understanding of the embedded domain expressions, their relationship to the DB schema structure. Existing benchmarks rely on unrealistic, ad-hoc query specific textual hints for expressing domain knowledge. In this paper, we propose a systematic framework for associating structured domain statements at the database level. We present retrieval of relevant structured domain statements given a user query using sub-string level match. We evaluate on eleven realistic DB schemas covering diverse domains across five open-source and proprietary LLMs and demonstrate that (1) DB level structured domain statements are more practical and accurate than existing ad-hoc query specific textual domain statements, and (2) Our sub-string match based retrieval of relevant domain statements provides significantly higher accuracy than other retrieval approaches. The impressive natural language understanding and code generation capabilities of modern LLMs has led to significantly improved performance on NL-SQL semantic parsing [1], [2]. However, their accuracy varies widely with the database queried [3]. DBs in WikiSQL [4] or Spider [5] contain semantically meaningful table/column names and cell values making it easier for LLMs to accurately link domain expressions in the NL query with the DB schema/cell elements.

large language model, machine learning, natural language, (20 more...)

2510.02394

Country:

Asia > India (0.46)
North America > United States > California (0.14)

Genre: Research Report (0.50)

Industry:

Leisure & Entertainment (0.48)
Education (0.46)
Health & Medicine > Therapeutic Area (0.30)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Neural Information Processing SystemsOct-3-2025, 04:02:34 GMT

Export Reviews, Discussions, Author Feedback and Meta-Reviews

First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. Mallows models are a classically studied class of distributions over permutations that can be viewed as a sequential model in which items are inserted one by one into a ranking. This paper proposes an interesting hierarchical generalization of Mallows models in which groups of items are sequentially ``merged'' together (as they would be in mergesort). The model can also be viewed as a special case of a recently proposed class of ``riffle independent'' models by Huang/Guestrin, but with a more tractable number of parameters in general and better computational properties. There are several nice contributions in this paper, including a simple and elegant characterization of identifiability of the structure, as well as an interesting structure estimation algorithm based on the inside-outside parsing algorithm for stochastic context free grammars.

algorithm, mallow model, permutation, (10 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Lebanon (0.06)
North America > Canada > Quebec > Montreal (0.05)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (0.89)