Goto

Collaborating Authors

 original sentence




For the First Time, AI Analyzes Language as Well as a Human Expert

WIRED

If language is what makes us human, what does it mean now that large language models have gained "metalinguistic" abilities? Among the myriad abilities that humans possess, which ones are uniquely human? Language has been a top candidate at least since Aristotle, who wrote that humanity was "the animal that has language." Even as large language models such as ChatGPT superficially replicate ordinary speech, researchers want to know if there are specific aspects of human language that simply have no parallels in the communication systems of other animals or artificially intelligent devices. In particular, researchers have been exploring the extent to which language models can reason about language itself.


LGM: Enhancing Large Language Models with Conceptual Meta-Relations and Iterative Retrieval

Lei, Wenchang, Zou, Ping, Wang, Yue, Sun, Feng, Zhao, Lei

arXiv.org Artificial Intelligence

Large language models (LLMs) exhibit strong semantic understanding, yet struggle when user instructions involve ambiguous or conceptually misaligned terms. We propose the Language Graph Model (LGM) to enhance conceptual clarity by extracting meta-relations-inheritance, alias, and composition-from natural language. The model further employs a reflection mechanism to validate these meta-relations. Leveraging a Concept Iterative Retrieval Algorithm, these relations and related descriptions are dynamically supplied to the LLM, improving its ability to interpret concepts and generate accurate responses. Unlike conventional Retrieval-Augmented Generation (RAG) approaches that rely on extended context windows, our method enables large language models to process texts of any length without the need for truncation. Experiments on standard benchmarks demonstrate that the LGM consistently outperforms existing RAG baselines.


DETECT: Determining Ease and Textual Clarity of German Text Simplifications

Korobeynikova, Maria, Battisti, Alessia, Fischer, Lukas, Gao, Yingqiang

arXiv.org Artificial Intelligence

Current evaluation of German automatic text simplification (ATS) relies on general-purpose metrics such as SARI, BLEU, and BERTScore, which insufficiently capture simplification quality in terms of simplicity, meaning preservation, and fluency. While specialized metrics like LENS have been developed for English, corresponding efforts for German have lagged behind due to the absence of human-annotated corpora. To close this gap, we introduce DETECT, the first German-specific metric that holistically evaluates ATS quality across all three dimensions of simplicity, meaning preservation, and fluency, and is trained entirely on synthetic large language model (LLM) responses. Our approach adapts the LENS framework to German and extends it with (i) a pipeline for generating synthetic quality scores via LLMs, enabling dataset creation without human annotation, and (ii) an LLM-based refinement step for aligning grading criteria with simplification requirements. To the best of our knowledge, we also construct the largest German human evaluation dataset for text simplification to validate our metric directly. Experimental results show that DETECT achieves substantially higher correlations with human judgments than widely used ATS metrics, with particularly strong gains in meaning preservation and fluency. Beyond ATS, our findings highlight both the potential and the limitations of LLMs for automatic evaluation and provide transferable guidelines for general language accessibility tasks.



Modeling the language cortex with form-independent and enriched representations of sentence meaning reveals remarkable semantic abstractness

Saha, Shreya, Li, Shurui, Tuckute, Greta, Li, Yuanning, Zhang, Ru-Yuan, Wehbe, Leila, Fedorenko, Evelina, Khosla, Meenakshi

arXiv.org Artificial Intelligence

The human language system represents both linguistic forms and meanings, but the abstractness of the meaning representations remains debated. Here, we searched for abstract representations of meaning in the language cortex by modeling neural responses to sentences using representations from vision and language models. When we generate images corresponding to sentences and extract vision model embeddings, we find that aggregating across multiple generated images yields increasingly accurate predictions of language cortex responses--sometimes rivaling large language models. Similarly, averaging embeddings across multiple paraphrases of a sentence improves prediction accuracy compared to any single paraphrase. Enriching paraphrases with contextual details that may be implicit (e.g., augmenting "I had a pancake" to include details like "maple syrup") further increases prediction accuracy, even surpassing predictions based on the embedding of the original sentence, suggesting that the language system maintains richer and broader semantic representations than language models. Together, these results demonstrate the existence of highly abstract, form-independent meaning representations within the language cortex.


A Beam Search Algorithm

Neural Information Processing Systems

Algorithm 1 demonstrates the step-by-step operations of our beam search algorithm (see Sec. 4.3). We consider recovering sentences in the current work. We leave recovering longer paragraphs as future work. We keep 2000 examples of each dataset as the evaluation set, and use the left for training. "End-to-End optimization", "Reg" means the inclusion of a regularization term, "DR" refers to a discrete token Our approach is unique as it does not rely on end-to-end optimization, is demonstrated on large batch sizes (i.e.


What am I missing here?: Evaluating Large Language Models for Masked Sentence Prediction

Wyatt, Charlie, Joshi, Aditya, Salim, Flora

arXiv.org Artificial Intelligence

Transformer-based models primarily rely on Next Token Prediction (NTP), which predicts the next token in a sequence based on the preceding context. However, NTP's focus on single-token prediction often limits a model's ability to plan ahead or maintain long-range coherence, raising questions about how well LLMs can predict longer contexts, such as full sentences within structured documents. While NTP encourages local fluency, it provides no explicit incentive to ensure global coherence across sentence boundaries--an essential skill for reconstructive or discursive tasks. To investigate this, we evaluate three commercial LLMs (GPT -4o, Claude 3.5 Sonnet, and Gemini 2.0 Flash) on Masked Sentence Prediction (MSP) -- the task of infilling a randomly removed sentence -- from three domains: ROCStories (narrative), Recipe1M (procedural), and Wikipedia (expository). We assess both fidelity (similarity to the original sentence) and cohesiveness (fit within the surrounding context). Our key finding reveals that commercial LLMs, despite their superlative performance in other tasks, are poor at predicting masked sentences in low-structured domains, highlighting a gap in current model capabilities.


What's Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content

Ferrara, Alfio, Picascia, Sergio, Pinnavaia, Laura, Ranitovic, Vojimir, Rocchetti, Elisabetta, Tuveri, Alice

arXiv.org Artificial Intelligence

Proprietary Large Language Models (LLMs) have shown tendencies toward politeness, formality, and implicit content moderation. While previous research has primarily focused on explicitly training models to moderate and detoxify sensitive content, there has been limited exploration of whether LLMs implicitly sanitize language without explicit instructions. This study empirically analyzes the implicit moderation behavior of GPT-4o-mini when paraphrasing sensitive content and evaluates the extent of sensitivity shifts. Our experiments indicate that GPT-4o-mini systematically moderates content toward less sensitive classes, with substantial reductions in derogatory and taboo language. Also, we evaluate the zero-shot capabilities of LLMs in classifying sentence sensitivity, comparing their performances against traditional methods.