Grammars & Parsing
HSGM: Hierarchical Segment-Graph Memory for Scalable Long-Text Semantics
Semantic parsing of long documents remains challenging due to quadratic growth in pairwise composition and memory requirements. We introduce \textbf{Hierarchical Segment-Graph Memory (HSGM)}, a novel framework that decomposes an input of length $N$ into $M$ meaningful segments, constructs \emph{Local Semantic Graphs} on each segment, and extracts compact \emph{summary nodes} to form a \emph{Global Graph Memory}. HSGM supports \emph{incremental updates} -- only newly arrived segments incur local graph construction and summary-node integration -- while \emph{Hierarchical Query Processing} locates relevant segments via top-$K$ retrieval over summary nodes and then performs fine-grained reasoning within their local graphs. Theoretically, HSGM reduces worst-case complexity from $O(N^2)$ to $O\!\left(N\,k + (N/k)^2\right)$, with segment size $k \ll N$, and we derive Frobenius-norm bounds on the approximation error introduced by node summarization and sparsification thresholds. Empirically, on three benchmarks -- long-document AMR parsing, segment-level semantic role labeling (OntoNotes), and legal event extraction -- HSGM achieves \emph{2--4$\times$ inference speedup}, \emph{$>60\%$ reduction} in peak memory, and \emph{$\ge 95\%$} of baseline accuracy. Our approach unlocks scalable, accurate semantic modeling for ultra-long texts, enabling real-time and resource-constrained NLP applications.
SongPrep: A Preprocessing Framework and End-to-end Model for Full-song Structure Parsing and Lyrics Transcription
Tan, Wei, Lei, Shun, Zhang, Huaicheng, Li, Guangzheng, Zhang, Yixuan, Chen, Hangting, Yu, Jianwei, Gu, Rongzhi, Yu, Dong
Artificial Intelligence Generated Content (AIGC) is currently a popular research area. Among its various branches, song generation has attracted growing interest. Despite the abundance of available songs, effective data preparation remains a significant challenge. Converting these songs into training-ready datasets typically requires extensive manual labeling, which is both time consuming and costly. To address this issue, we propose SongPrep, an automated preprocessing pipeline designed specifically for song data. This framework streamlines key processes such as source separation, structure analysis, and lyric recognition, producing structured data that can be directly used to train song generation models. Furthermore, we introduce SongPrepE2E, an end-to-end structured lyrics recognition model based on pretrained language models. Without the need for additional source separation, SongPrepE2E is able to analyze the structure and lyrics of entire songs and provide precise timestamps. By leveraging context from the whole song alongside pretrained semantic knowledge, SongPrepE2E achieves low Diarization Error Rate (DER) and Word Error Rate (WER) on the proposed SSLD-200 dataset. Downstream tasks demonstrate that training song generation models with the data output by SongPrepE2E enables the generated songs to closely resemble those produced by humans.
Reasoning Core: A Scalable RL Environment for LLM Symbolic Reasoning
Lacombe, Valentin, Quesnel, Valentin, Sileo, Damien
We introduce Reasoning Core, a new scalable environment for Reinforcement Learning with Verifiable Rewards (RLVR), designed to advance foundational symbolic reasoning in Large Language Models (LLMs). Unlike existing benchmarks that focus on games or isolated puzzles, Reasoning Core procedurally generates problems across core formal domains, including PDDL planning, first-order logic, context-free grammar parsing, causal reasoning, and system equation solving. The environment is built on key design principles of high-generality problem distributions, verification via external tools, and continuous difficulty control, which together provide a virtually infinite supply of novel training instances. Initial zero-shot evaluations with frontier LLMs confirm the difficulty of Reasoning Core's tasks, positioning it as a promising resource to improve the reasoning capabilities of future models.
Text-Scene: A Scene-to-Language Parsing Framework for 3D Scene Understanding
Li, Haoyuan, Liu, Rui, Fan, Hehe, Yang, Yi
Enabling agents to understand and interact with complex 3D scenes is a fundamental challenge for embodied artificial intelligence systems. While Multimodal Large Language Models (MLLMs) have achieved significant progress in 2D image understanding, extending such capabilities to 3D scenes remains difficult: 1) 3D environment involves richer concepts such as spatial relationships, affordances, physics, layout, and so on, 2) the absence of large-scale 3D vision-language datasets has posed a significant obstacle. In this paper, we introduce Text-Scene, a framework that automatically parses 3D scenes into textual descriptions for scene understanding. Given a 3D scene, our model identifies object attributes and spatial relationships, and then generates a coherent summary of the whole scene, bridging the gap between 3D observation and language without requiring human-in-the-loop intervention. By leveraging both geometric analysis and MLLMs, Text-Scene produces descriptions that are accurate, detailed, and human-interpretable, capturing object-level details and global-level context. Experimental results on benchmarks demonstrate that our textual parses can faithfully represent 3D scenes and benefit downstream tasks. To evaluate the reasoning capability of MLLMs, we present InPlan3D, a comprehensive benchmark for 3D task planning, consisting of 3174 long-term planning tasks across 636 indoor scenes. We emphasize clarity and accessibility in our approach, aiming to make 3D scene content understandable through language. Code and datasets will be released.
Refining Syntactic Distinctions Using Decision Trees: A Paper on Postnominal 'That' in Complement vs. Relative Clauses
In this study, we first tested the performance of the TreeTagger English model developed by Helmut Schmid with test files at our disposal, using this model to analyze relative clauses and noun complement clauses in English. We distinguished between the two uses of "that," both as a relative pronoun and as a complementizer. To achieve this, we employed an algorithm to reannotate a corpus that had originally been parsed using the Universal Dependency framework with the EWT Treebank. In the next phase, we proposed an improved model by retraining TreeTagger and compared the newly trained model with Schmid's baseline model. This process allowed us to fine-tune the model's performance to more accurately capture the subtle distinctions in the use of "that" as a complementizer and as a nominal. We also examined the impact of varying the training dataset size on TreeTagger's accuracy and assessed the representativeness of the EWT Treebank files for the structures under investigation. Additionally, we analyzed some of the linguistic and structural factors influencing the ability to effectively learn this distinction.
CL$^2$GEC: A Multi-Discipline Benchmark for Continual Learning in Chinese Literature Grammatical Error Correction
Qin, Shang, Ye, Jingheng, Li, Yinghui, Zheng, Hai-Tao, Li, Qi, Shan, Jinxiao, Li, Zhixing, Kim, Hong-Gee
The growing demand for automated writing assistance in diverse academic domains highlights the need for robust Chinese Grammatical Error Correction (CGEC) systems that can adapt across disciplines. However, existing CGEC research largely lacks dedicated benchmarks for multi-disciplinary academic writing, overlooking continual learning (CL) as a promising solution to handle domain-specific linguistic variation and prevent catastrophic forgetting. To fill this crucial gap, we introduce CL$^2$GEC, the first Continual Learning benchmark for Chinese Literature Grammatical Error Correction, designed to evaluate adaptive CGEC across multiple academic fields. Our benchmark includes 10,000 human-annotated sentences spanning 10 disciplines, each exhibiting distinct linguistic styles and error patterns. CL$^2$GEC focuses on evaluating grammatical error correction in a continual learning setting, simulating sequential exposure to diverse academic disciplines to reflect real-world editorial dynamics. We evaluate large language models under sequential tuning, parameter-efficient adaptation, and four representative CL algorithms, using both standard GEC metrics and continual learning metrics adapted to task-level variation. Experimental results reveal that regularization-based methods mitigate forgetting more effectively than replay-based or naive sequential approaches. Our benchmark provides a rigorous foundation for future research in adaptive grammatical error correction across diverse academic domains.
UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment
Imperial, Joseph Marvin, Barayan, Abdullah, Stodden, Regina, Wilkens, Rodrigo, Sanchez, Ricardo Munoz, Gao, Lingyun, Torgbi, Melissa, Knight, Dawn, Forey, Gail, Jablonkai, Reka R., Kochmar, Ekaterina, Reynolds, Robert, Ribeiro, Eugรฉnio, Saggion, Horacio, Volodina, Elena, Vajjala, Sowmya, Franรงois, Thomas, Alva-Manchego, Fernando, Madabushi, Harish Tayyar
We introduce UniversalCEFR, a large-scale multilingual and multidimensional dataset of texts annotated with CEFR (Common European Framework of Reference) levels in 13 languages. To enable open research in automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modelling across tasks and languages. To demonstrate its utility, we conduct benchmarking experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution for language proficiency research by standardising dataset formats, and promoting their accessibility to the global research community.
Steering Language Models in Multi-Token Generation: A Case Study on Tense and Aspect
Klerings, Alina, Brinkmann, Jannik, Ruffinelli, Daniel, Ponzetto, Simone
Large language models (LLMs) are able to generate grammatically well-formed text, but how do they encode their syntactic knowledge internally? While prior work has focused largely on binary grammatical contrasts, in this work, we study the representation and control of two multidimensional hierarchical grammar phenomena - verb tense and aspect - and for each, identify distinct, orthogonal directions in residual space using linear discriminant analysis. Next, we demonstrate causal control over both grammatical features through concept steering across three generation tasks. Then, we use these identified features in a case study to investigate factors influencing effective steering in multi-token generation. We find that steering strength, location, and duration are crucial parameters for reducing undesirable side effects such as topic shift and degeneration. Our findings suggest that models encode tense and aspect in structurally organized, human-like ways, but effective control of such features during generation is sensitive to multiple factors and requires manual tuning or automated optimization.
Struct-Bench: A Benchmark for Differentially Private Structured Text Generation
Wang, Shuaiqi, Raunak, Vikas, Backurs, Arturs, Reis, Victor, Zhou, Pei, Chen, Sihao, Yang, Longqi, Lin, Zinan, Yekhanin, Sergey, Fanti, Giulia
Differentially private (DP) synthetic data generation is a promising technique for utilizing private datasets that otherwise cannot be exposed for model training or other analytics. While much research literature has focused on generating private unstructured text and image data, in enterprise settings, structured data (e.g., tabular) is more common, often including natural language fields or components. Existing synthetic data evaluation techniques (e.g., FID) struggle to capture the structural properties and correlations of such datasets. In this work, we propose Struct-Bench, a framework and benchmark for evaluating synthetic datasets derived from structured datasets that contain natural language data. The Struct-Bench framework requires users to provide a representation of their dataset structure as a Context-Free Grammar (CFG). Our benchmark comprises 5 real-world and 2 synthetically generated datasets, each annotated with CFGs. We show that these datasets demonstrably present a great challenge even for state-of-the-art DP synthetic data generation methods. Struct-Bench also includes reference implementations of different metrics and a leaderboard, thereby providing researchers a standardized evaluation platform to benchmark and investigate privacy-preserving synthetic data generation methods. Further, we also present a case study showing how to use Struct-Bench to improve the synthetic data quality of Private Evolution (PE) on structured data. The benchmark and the leaderboard have been publicly made available at https://struct-bench.github.io.
Linguistic trajectories of bipolar disorder on social media
Plank, Laurin, Zlomuzica, Armin
Correspondence should be addressed to: Laurin Plank. This paper has not yet been peer - reviewed Abstract Language provides valuable markers of affective disorders such as bipolar disorder (BD), yet clinical assessments remain limited in scale. In response, analyses of social media (SM) language have gained prominence due to their high temporal resolution and longitudinal scope. Here, we introduce a method to determine the timing of users' diagnoses and apply it to study language trajectories from 3 years before to 21 years after BD diagnosis - contrasted with uses reporting unipolar depression (UD) and non - aff ected users (HC). We show that BD diagnosis is accompanied by pervasive linguistic alterations reflecting mood disturbance, psychiatric comorbidity, substance abuse, hospitalization, medical comorbidities, unusual thought content, and disorganized thought. W e further observe recurring mood - related language change s across two decades after the diagnosis, with a pronounced 12 - month periodicity suggestive of seasonal mood episodes. Finally, trend - level evidence suggests an increased periodicity in users estima ted to be female. In sum, our findings provide evidence for language alterations in the acute and chronic phase of BD. Th i s validates and extends recent efforts leveraging SM for scalable monitoring of mental health. Knowledge of diagnosis events allows language alterations to be contextualized with respect to the current disorder phase . For example, it would allow comparing language change from a premorbid to the acute disorder phase, or to study long - term behavioral patterns in the chronic disorder phase . W e then use the resulting digital clinical cohorts (DICCs) to study longitudinal language trajectories in users who self - disclose having been diagnosed with BD. This time information is then passed to SUTime, a temporal parsing algorithm, which yielded normalized datetime information. T hese data are additionally filtered through a rule - based algorithm to exclude non - viable datetimes (e.g., those including seasonal information such as "spring, 2022"). Pseudo - diagnoses are assigned to a group of regular Reddit users who served as a healthy control group (HC). Fig . 1 gives an overview of the DICC s pipeline.