Goto

Collaborating Authors

 lexicon


LEXICON: a Benchmark for Planning under Temporal Constraints in Natural Language

Neural Information Processing Systems

Owing to their reasoning capabilities, large language models (LLMs) have been evaluated on planning tasks described in natural language. However, LLMs have largely been tested on planning domains without constraints. In order to deploy them in real-world settings where adherence to constraints, in particular safety constraints, is critical, we need to evaluate their performance on constrained planning tasks. We introduce LEXICON--a natural language-based (LEXI) constrained (CON) planning benchmark, consisting of a suite of environments, that can be used to evaluate the planning capabilities of LLMs in a principled fashion. The core idea behind LEXICON is to take existing planning environments and impose temporal constraints on the states.


Far from the Shallow: Brain-Predictive Reasoning Embedding through Residual Disentanglement

Neural Information Processing Systems

Understanding how the human brain progresses from processing simple linguistic inputs to performing high-level reasoning is a fundamental challenge in neuroscience. While modern large language models (LLMs) are increasingly used to model neural responses to language, their internal representations are highly entangled, mixing information about lexicon, syntax, meaning, and reasoning. This entanglement biases conventional brain encoding analyses toward linguistically shallow features (e.g., lexicon and syntax), making it difficult to isolate the neural substrates of cognitively deeper processes. Here, we introduce a residual disentanglement method that computationally isolates these components. By first probing an LM to identify feature-specific layers, our method iteratively regresses out lower-level representations to produce four nearly orthogonal embeddings for lexicon, syntax, meaning, and, critically, reasoning. We used these disentangled embeddings to model intracranial (ECoG) brain recordings from neurosurgical patients listening to natural speech. We show that: 1) This isolated reasoning embedding exhibits unique predictive power, accounting for variance in neural activity not explained by other linguistic features and even extending to the recruitment of visual regions beyond classical language areas.



Conjugated Semantic Pool Improves OOD Detection with Pre-trained Vision-Language Models

Neural Information Processing Systems

A straightforward pipeline for zero-shot out-of-distribution (OOD) detection involves selecting potential OOD labels from an extensive semantic pool and then leveraging a pre-trained vision-language model to perform classification on both in-distribution (ID) and OOD labels. In this paper, we theorize that enhancing performance requires expanding the semantic pool, while increasing the expected probability of selected OOD labels being activated by OOD samples, and ensuring low mutual dependence among the activations of these OOD labels. A natural expansion manner is to adopt a larger lexicon; however, the inevitable introduction of numerous synonyms and uncommon words fails to meet the above requirements, indicating that viable expansion manners move beyond merely selecting words from a lexicon. Since OOD detection aims to correctly classify input images into ID/OOD class groups, we can make up OOD label candidates which are not standard class names but beneficial for the process. Observing that the original semantic pool is comprised of unmodified specific class names, we correspondingly construct a conjugated semantic pool (CSP) consisting of modified superclass names, each serving as a cluster center for samples sharing similar properties across different categories. Consistent with our established theory, expanding OOD label candidates with the CSP satisfies the requirements and outperforms existing works by 7.89% in FPR95.




AttrSeg: Open-Vocabulary Semantic Segmentation via Attribute Decomposition-Aggregation

Neural Information Processing Systems

Open-vocabulary semantic segmentation is a challenging task that requires segmenting novel object categories at inference time. Recent works explore vision-language pre-training to handle this task, but suffer from unrealistic assumptions in practical scenarios, i.e., low-quality textual category names.For example, this paradigm assumes that new textual categories will be accurately and completely provided, and exist in lexicons during pre-training.However, exceptions often happen when meet with ambiguity for brief or incomplete names, new words that are not present in the pre-trained lexicons, and difficult-to-describe categories for users.To address these issues, this work proposes a novel framework, AttrSeg


TriLex: A Framework for Multilingual Sentiment Analysis in Low-Resource South African Languages

arXiv.org Artificial Intelligence

Low-resource African languages remain underrepresented in sentiment analysis research, resulting in limited lexical resources and reduced model performance in multilingual applications. This gap restricts equitable access to Natural Language Processing (NLP) technologies and hinders downstream tasks such as public-health monitoring, digital governance, and financial inclusion. To address this challenge, this paper introduces TriLex, a three-stage retrieval-augmented framework that integrates corpus-based extraction, cross-lingual mapping, and Retrieval-Augmented Generation (RAG) driven lexicon refinement for scalable sentiment lexicon expansion in low-resource languages. Using an expanded lexicon, we evaluate two leading African language models (AfroXLMR and AfriBERTa) across multiple case studies. Results show that AfroXLMR consistently achieves the strongest performance, with F1-scores exceeding 80% for isiXhosa and isiZulu, aligning with previously reported ranges (71-75%), and demonstrating high multilingual stability with narrow confidence intervals. AfriBERTa, despite lacking pre-training on the target languages, attains moderate but reliable F1-scores around 64%, confirming its effectiveness under constrained computational settings. Comparative analysis shows that both models outperform traditional machine learning baselines, while ensemble evaluation combining AfroXLMR variants indicates complementary improvements in precision and overall stability. These findings confirm that the TriLex framework, together with AfroXLMR and AfriBERTa, provides a robust and scalable approach for sentiment lexicon development and multilingual sentiment analysis in low-resource South African languages.


NeLLCom-Lex: A Neural-agent Framework to Study the Interplay between Lexical Systems and Language Use

arXiv.org Artificial Intelligence

Lexical semantic change has primarily been investigated with observational and experimental methods; however, observational methods (corpus analysis, distributional semantic modeling) cannot get at causal mechanisms, and experimental paradigms with humans are hard to apply to semantic change due to the extended diachronic processes involved. This work introduces NeLLCom-Lex, a neural-agent framework designed to simulate semantic change by first grounding agents in a real lexical system (e.g. English) and then systematically manipulating their communicative needs. Using a well-established color naming task, we simulate the evolution of a lexical system within a single generation, and study which factors lead agents to: (i) develop human-like naming behavior and lexicons, and (ii) change their behavior and lexicons according to their communicative needs. Our experiments with different supervised and reinforcement learning pipelines show that neural agents trained to 'speak' an existing language can reproduce human-like patterns in color naming to a remarkable extent, supporting the further use of NeLLCom-Lex to elucidate the mechanisms of semantic change.


ChiKhaPo: A Large-Scale Multilingual Benchmark for Evaluating Lexical Comprehension and Generation in Large Language Models

arXiv.org Artificial Intelligence

Existing benchmarks for large language models (LLMs) are largely restricted to high- or mid-resource languages, and often evaluate performance on higher-order tasks in reasoning and generation. However, plenty of evidence points to the fact that LLMs lack basic linguistic competence in the vast majority of the world's 3800+ written languages. We introduce ChiKhaPo, consisting of 8 subtasks of varying difficulty designed to evaluate the lexical comprehension and generation abilities of generative models. ChiKhaPo draws on existing lexicons, monolingual data, and bitext, and provides coverage for 2700+ languages for 2 subtasks, surpassing any existing benchmark in terms of language coverage. We further show that 6 SOTA models struggle on our benchmark, and discuss the factors contributing to performance scores, including language family, language resourcedness, task, and comprehension versus generation directions. With ChiKhaPo, we hope to enable and encourage the massively multilingual benchmarking of LLMs.