Zeng, Qingcheng
Leveraging Human Production-Interpretation Asymmetries to Test LLM Cognitive Plausibility
Lam, Suet-Ying, Zeng, Qingcheng, Wu, Jingyi, Voigt, Rob
Whether large language models (LLMs) process language similarly to humans has been the subject of much theoretical and practical debate. We examine this question through the lens of the production-interpretation distinction found in human sentence processing and evaluate the extent to which instruction-tuned LLMs replicate this distinction. Using an empirically documented asymmetry between production and interpretation in humans for implicit causality verbs as a testbed, we find that some LLMs do quantitatively and qualitatively reflect human-like asymmetries between production and interpretation. We demonstrate that whether this behavior holds depends upon both model size - with larger models more likely to reflect human-like patterns and the choice of meta-linguistic prompts used to elicit the behavior.
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation
Xuan, Weihao, Yang, Rui, Qi, Heli, Zeng, Qingcheng, Xiao, Yunze, Xing, Yun, Wang, Junjue, Li, Huitao, Li, Xin, Yu, Kunyu, Liu, Nan, Chen, Qingyu, Teodoro, Douglas, Marrese-Taylor, Edison, Lu, Shijian, Iwasawa, Yusuke, Matsuo, Yutaka, Li, Irene
Traditional benchmarks struggle to evaluate increasingly sophisticated language models in multilingual and culturally diverse contexts. To address this gap, we introduce MMLU-ProX, a comprehensive multilingual benchmark covering 13 typologically diverse languages with approximately 11,829 questions per language. Building on the challenging reasoning-focused design of MMLU-Pro, our framework employs a semi-automatic translation process: translations generated by state-of-the-art large language models (LLMs) are rigorously evaluated by expert annotators to ensure conceptual accuracy, terminological consistency, and cultural relevance. We comprehensively evaluate 25 state-of-the-art LLMs using 5-shot chain-of-thought (CoT) and zero-shot prompting strategies, analyzing their performance across linguistic and cultural boundaries. Our experiments reveal consistent performance degradation from high-resource languages to lower-resource ones, with the best models achieving over 70% accuracy on English but dropping to around 40% for languages like Swahili, highlighting persistent gaps in multilingual capabilities despite recent advances. MMLU-ProX is an ongoing project; we are expanding our benchmark by incorporating additional languages and evaluating more language models to provide a more comprehensive assessment of multilingual capabilities.
SCORE: Saturated Consensus Relocalization in Semantic Line Maps
Jiang, Haodong, Zheng, Xiang, Zhang, Yanglin, Zeng, Qingcheng, Li, Yiqian, Hong, Ziyang, Wu, Junfeng
This is the arxiv version for our paper submitted to IEEE/RSJ IROS 2025. We propose a scene-agnostic and light-weight visual relocalization framework that leverages semantically labeled 3D lines as a compact map representation. In our framework, the robot localizes itself by capturing a single image, extracting 2D lines, associating them with semantically similar 3D lines in the map, and solving a robust perspective-n-line problem. To address the extremely high outlier ratios~(exceeding 99.5\%) caused by one-to-many ambiguities in semantic matching, we introduce the Saturated Consensus Maximization~(Sat-CM) formulation, which enables accurate pose estimation when the classic Consensus Maximization framework fails. We further propose a fast global solver to the formulated Sat-CM problems, leveraging rigorous interval analysis results to ensure both accuracy and computational efficiency. Additionally, we develop a pipeline for constructing semantic 3D line maps using posed depth images. To validate the effectiveness of our framework, which integrates our innovations in robust estimation and practical engineering insights, we conduct extensive experiments on the ScanNet++ dataset.
ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning
Huang, Shulin, Yang, Linyi, Song, Yan, Chen, Shuang, Cui, Leyang, Wan, Ziyu, Zeng, Qingcheng, Wen, Ying, Shao, Kun, Zhang, Weinan, Wang, Jun, Zhang, Yue
Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to evaluate LLMs' reasoning capability robustly. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2,912 samples drawn from reasoning tasks. ThinkBench unifies the evaluation of reasoning models and non-reasoning models. We evaluate 16 LLMs and 4 PRMs under identical experimental conditions and show that most of the LLMs' performance are far from robust and they face a certain level of data leakage. By dynamically generating OOD datasets, ThinkBench effectively provides a reliable evaluation of LLMs and reduces the impact of data contamination.
Toward Equitable Access: Leveraging Crowdsourced Reviews to Investigate Public Perceptions of Health Resource Accessibility
Xue, Zhaoqian, Liu, Guanhong, Wei, Kai, Zhang, Chong, Zeng, Qingcheng, Hu, Songhua, Hua, Wenyue, Fan, Lizhou, Zhang, Yongfeng, Li, Lingyao
Access to health resources is a critical determinant of public well-being and societal resilience, particularly during public health crises when demand for medical services and preventive care surges. However, disparities in accessibility persist across demographic and geographic groups, raising concerns about equity. Traditional survey methods often fall short due to limitations in coverage, cost, and timeliness. This study leverages crowdsourced data from Google Maps reviews, applying advanced natural language processing techniques, specifically ModernBERT, to extract insights on public perceptions of health resource accessibility in the United States during the COVID-19 pandemic. Additionally, we employ Partial Least Squares regression to examine the relationship between accessibility perceptions and key socioeconomic and demographic factors including political affiliation, racial composition, and educational attainment. Our findings reveal that public perceptions of health resource accessibility varied significantly across the U.S., with disparities peaking during the pandemic and slightly easing post-crisis. Political affiliation, racial demographics, and education levels emerged as key factors shaping these perceptions. These findings underscore the need for targeted interventions and policy measures to address inequities, fostering a more inclusive healthcare infrastructure that can better withstand future public health challenges.
Sympathy over Polarization: A Computational Discourse Analysis of Social Media Posts about the July 2024 Trump Assassination Attempt
Zeng, Qingcheng, Liu, Guanhong, Xue, Zhaoqian, Ford, Diego, Voigt, Rob, Hagen, Loni, Li, Lingyao
On July 13, 2024, at the Trump rally in Pennsylvania, someone attempted to assassinate Republican Presidential Candidate Donald Trump. This attempt sparked a large-scale discussion on social media. We collected posts from X (formerly known as Twitter) one week before and after the assassination attempt and aimed to model the short-term effects of such a ``shock'' on public opinions and discussion topics. Specifically, our study addresses three key questions: first, we investigate how public sentiment toward Donald Trump shifts over time and across regions (RQ1) and examine whether the assassination attempt itself significantly affects public attitudes, independent of the existing political alignments (RQ2). Finally, we explore the major themes in online conversations before and after the crisis, illustrating how discussion topics evolved in response to this politically charged event (RQ3). By integrating large language model-based sentiment analysis, difference-in-differences modeling, and topic modeling techniques, we find that following the attempt the public response was broadly sympathetic to Trump rather than polarizing, despite baseline ideological and regional disparities.
Causal Micro-Narratives
Heddaya, Mourad, Zeng, Qingcheng, Tan, Chenhao, Voigt, Rob, Zentefis, Alexander
We present a novel approach to classify causal micro-narratives from text. These narratives are sentence-level explanations of the cause(s) and/or effect(s) of a target subject. The approach requires only a subject-specific ontology of causes and effects, and we demonstrate it with an application to inflation narratives. Using a human-annotated dataset spanning historical and contemporary US news articles for training, we evaluate several large language models (LLMs) on this multi-label classification task. The best-performing model--a fine-tuned Llama 3.1 8B--achieves F1 scores of 0.87 on narrative detection and 0.71 on narrative classification. Comprehensive error analysis reveals challenges arising from linguistic ambiguity and highlights how model errors often mirror human annotator disagreements. This research establishes a framework for extracting causal micro-narratives from real-world data, with wide-ranging applications to social science research.
Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis
Li, Daoyang, Jin, Mingyu, Zeng, Qingcheng, Zhao, Haiyan, Du, Mengnan
Probing techniques for large language models (LLMs) have primarily focused on English, overlooking the vast majority of the world's languages. In this paper, we extend these probing methods to a multilingual context, investigating the behaviors of LLMs across diverse languages. We conduct experiments on several open-source LLM models, analyzing probing accuracy, trends across layers, and similarities between probing vectors for multiple languages. Our key findings reveal: (1) a consistent performance gap between high-resource and low-resource languages, with high-resource languages achieving significantly higher probing accuracy; (2) divergent layer-wise accuracy trends, where high-resource languages show substantial improvement in deeper layers similar to English; and (3) higher representational similarities among high-resource languages, with low-resource languages demonstrating lower similarities both among themselves and with high-resource languages. These results highlight significant disparities in LLMs' multilingual capabilities and emphasize the need for improved modeling of low-resource languages.
Distributed Invariant Kalman Filter for Object-level Multi-robot Pose SLAM
Li, Haoying, Zeng, Qingcheng, Li, Haoran, Zhang, Yanglin, Wu, Junfeng
Cooperative localization and target tracking are essential for multi-robot systems to implement high-level tasks. To this end, we propose a distributed invariant Kalman filter based on covariance intersection for effective multi-robot pose estimation. The paper utilizes the object-level measurement models, which have condensed information further reducing the communication burden. Besides, by modeling states on special Lie groups, the better linearity and consistency of the invariant Kalman filter structure can be stressed. We also use a combination of CI and KF to avoid overly confident or conservative estimates in multi-robot systems with intricate and unknown correlations, and some level of robot degradation is acceptable through multi-robot collaboration. The simulation and real data experiment validate the practicability and superiority of the proposed algorithm.
Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models
Zeng, Qingcheng, Jin, Mingyu, Yu, Qinkai, Wang, Zhenting, Hua, Wenyue, Zhou, Zihao, Sun, Guangyan, Meng, Yanda, Ma, Shiqing, Wang, Qifan, Juefei-Xu, Felix, Ding, Kaize, Yang, Fan, Tang, Ruixiang, Zhang, Yongfeng
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial. One commonly used method to assess the reliability of LLMs' responses is uncertainty estimation, which gauges the likelihood of their answers being correct. While many studies focus on improving the accuracy of uncertainty estimations for LLMs, our research investigates the fragility of uncertainty estimation and explores potential attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output. Specifically, the proposed backdoor attack method can alter an LLM's output probability distribution, causing the probability distribution to converge towards an attacker-predefined distribution while ensuring that the top-1 prediction remains unchanged. Our experimental results demonstrate that this attack effectively undermines the model's self-evaluation reliability in multiple-choice questions. For instance, we achieved a 100 attack success rate (ASR) across three different triggering strategies in four models. Further, we investigate whether this manipulation generalizes across different prompts and domains. This work highlights a significant threat to the reliability of LLMs and underscores the need for future defenses against such attacks. The code is available at https://github.com/qcznlp/uncertainty_attack.