Goto

Collaborating Authors

 intermediate


When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models

Wang, Keyu, Li, Jin, Yang, Shu, Zhang, Zhuoran, Wang, Di

arXiv.org Artificial Intelligence

Large Language Models (LLMs) often exhibit sycophantic behavior, agreeing with user-stated opinions even when those contradict factual knowledge. While prior work has documented this tendency, the internal mechanisms that enable such behavior remain poorly understood. In this paper, we provide a mechanistic account of how sycophancy arises within LLMs. We first systematically study how user opinions induce sycophancy across different model families. We find that simple opinion statements reliably induce sycophancy, whereas user expertise framing has a negligible impact. Through logit-lens analysis and causal activation patching, we identify a two-stage emergence of sycophancy: (1) a late-layer output preference shift and (2) deeper representational divergence. We also verify that user authority fails to influence behavior because models do not encode it internally. In addition, we examine how grammatical perspective affects sycophantic behavior, finding that first-person prompts (``I believe...'') consistently induce higher sycophancy rates than third-person framings (``They believe...'') by creating stronger representational perturbations in deeper layers. These findings highlight that sycophancy is not a surface-level artifact but emerges from a structural override of learned knowledge in deeper layers, with implications for alignment and truthful AI systems.


oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning

Xu, Ruiling, Zhang, Yifan, Wang, Qingyun, Edwards, Carl, Ji, Heng

arXiv.org Artificial Intelligence

Organic reaction mechanisms are the stepwise elementary reactions by which reactants form intermediates and products, and are fundamental to understanding chemical reactivity and designing new molecules and reactions. Although large language models (LLMs) have shown promise in understanding chemical tasks such as synthesis design, it is unclear to what extent this reflects genuine chemical reasoning capabilities, i.e., the ability to generate valid intermediates, maintain chemical consistency, and follow logically coherent multi-step pathways. We address this by introducing oMeBench, the first large-scale, expert-curated benchmark for organic mechanism reasoning in organic chemistry. It comprises over 10,000 annotated mechanistic steps with intermediates, type labels, and difficulty ratings. Furthermore, to evaluate LLM capability more precisely and enable fine-grained scoring, we propose oMeS, a dynamic evaluation framework that combines step-level logic and chemical similarity. We analyze the performance of state-of-the-art LLMs, and our results show that although current models display promising chemical intuition, they struggle with correct and consistent multi-step reasoning. Notably, we find that using prompting strategy and fine-tuning a specialist model on our proposed dataset increases performance by 50% over the leading closed-source model. We hope that oMeBench will serve as a rigorous foundation for advancing AI systems toward genuine chemical reasoning.


Dissecting Transformers: A CLEAR Perspective towards Green AI

Jain, Hemang, Goyal, Shailender, Pandey, Divyansh, Vaidhyanathan, Karthik

arXiv.org Artificial Intelligence

The rapid adoption of Large Language Models (LLMs) has raised significant environmental concerns. Unlike the one-time cost of training, LLM inference occurs continuously at a global scale and now dominates the AI energy footprint. Yet, most sustainability studies report only coarse, model-level metrics due to the lack of fine-grained measurement methods, treating energy efficiency more as an afterthought than as a primary objective. We present the first fine-grained empirical analysis of inference energy across core components of transformer architecture. We propose a novel methodology, Component-Level Energy Assessment via Repeated sampling (CLEAR), to overcome temporal mismatch between microsecond scale component execution and monitoring of millisecond (ms) scale energy sensors. Using CLEAR, we evaluate 15 models spanning four distinct architecture types and consistently keep component-wise energy variance below 9.5\% while capturing more than 90\% of the model's total energy as individual components. Our empirical analysis reveals that Attention blocks consume significantly more energy per floating-point operation (FLOP), indicating that energy consumption is not proportionally aligned with FLOP counts. This shows that FLOPs alone fail to capture the true energy cost at a component level. Our findings establish detailed component-level energy baselines and provide insight as an initial step to build energy-efficient transformer models through component-level optimizations.


FetalSleepNet: A Transfer Learning Framework with Spectral Equalisation Domain Adaptation for Fetal Sleep Stage Classification

Tang, Weitao, Vargas-Calixto, Johann, Katebi, Nasim, Tran, Nhi, Kelly, Sharmony B., Clifford, Gari D., Galinsky, Robert, Marzbanrad, Faezeh

arXiv.org Artificial Intelligence

Abstract--Introduction: This study presents FetalSleepNet, the first published deep learning approach to classifying sleep states from the ovine electroencephalogram (EEG). Fetal EEG is complex to acquire and difficult and laborious to interpret consistently. However, accurate sleep stage classification may aid in the early detection of abnormal brain maturation associated with pregnancy complications (e.g. Methods: EEG electrodes were secured onto the ovine dura over the parietal cortices of 24 late-gestation fetal sheep. A lightweight deep neural network originally developed for adult EEG sleep staging was trained on the ovine EEG using transfer learning from adult EEG. A spectral equalisation-based domain adaptation strategy was used to reduce cross-domain mismatch. Results: We demonstrated that while direct transfer performed poorly, full fine-tuning combined with spectral equalisation achieved the best overall performance (accuracy: 86.6%, macro F1-score: 62.5), outperforming baseline models. Conclusions: T o the best of our knowledge, FetalSleepNet is the first deep learning framework specifically developed for automated sleep staging from the fetal EEG. Beyond the laboratory, the EEG-based sleep stage classifier functions as a label engine, enabling large-scale weak/semi-supervised labeling and distillation to facilitate training on less invasive signals that can be acquired in the clinic, such as Doppler Ultrasound or electrocardiogram data. FetalSleepNet's lightweight design makes it well suited for deployment in low-power, real-time, and wearable fetal monitoring systems. LEEP state patterns reflect fetal neurophysiological function and development [1], and are clinically relevant for detecting abnormal neurodevelopment, which may result from conditions such as chronic hypoxia, infection or hypertensive disorders of pregnancy (HDP) [2]-[4]. J. V argas-Calixto, N. Katebi, and G. D. Clifford are with the Department of Biomedical Informatics, Emory University, Atlanta, USA. Nhi Tran, R. Galinsky and S. B. Kelly are with the Ritchie Centre, Hudson Institute of Medical Research, Melbourne, Australia. G. D. Clifford is also with the Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, USA.


QHackBench: Benchmarking Large Language Models for Quantum Code Generation Using PennyLane Hackathon Challenges

Basit, Abdul, Shao, Minghao, Asif, Muhammad Haider, Innan, Nouhaila, Kashif, Muhammad, Marchisio, Alberto, Shafique, Muhammad

arXiv.org Artificial Intelligence

Recent advances in Large Language Models (LLMs) have demonstrated strong potential in code generation, yet their effectiveness in quantum computing remains underexplored. This paper benchmarks LLMs for PennyLane-based quantum code generation using real-world challenges from the Quantum Hackathon (QHack). We introduce QHackBench, a novel benchmark dataset derived from QHack competitions, and evaluate model performance under vanilla prompting and Retrieval-Augmented Generation (RAG). Our structured evaluation framework assesses functional correctness, syntactic validity, and execution success across varying challenge difficulties. Results indicate that RAG-enhanced models, supplemented with an augmented PennyLane dataset, approximately generate similar results as the standard prompting, particularly in complex quantum algorithms. Additionally, we introduce a multi-agent evaluation pipeline that iteratively refines incorrect solutions, further enhancing execution success rates. To foster further research, we commit to publicly releasing QHackBench, along with our evaluation framework and experimental results, enabling continued advancements in AI-assisted quantum programming.


OSS-UAgent: An Agent-based Usability Evaluation Framework for Open Source Software

Meng, Lingkai, Shao, Yu, Yuan, Long, Lai, Longbin, Cheng, Peng, Yu, Wenyuan, Zhang, Wenjie, Lin, Xuemin, Zhou, Jingren

arXiv.org Artificial Intelligence

Usability evaluation is critical to the impact and adoption of open source software (OSS), yet traditional methods relying on human evaluators suffer from high costs and limited scalability. To address these limitations, we introduce OSS-UAgent, an automated, configurable, and interactive agent-based usability evaluation framework specifically designed for open source software. Our framework employs intelligent agents powered by large language models (LLMs) to simulate developers performing programming tasks across various experience levels (from Junior to Expert). By dynamically constructing platform-specific knowledge bases, OSS-UAgent ensures accurate and context-aware code generation. The generated code is automatically evaluated across multiple dimensions, including compliance, correctness, and readability, providing a comprehensive measure of the software's usability. Additionally, our demonstration showcases OSS-UAgent's practical application in evaluating graph analytics platforms, highlighting its effectiveness in automating usability evaluation.


Recursive Decomposition of Logical Thoughts: Framework for Superior Reasoning and Knowledge Propagation in Large Language Models

Qasim, Kaleem Ullah, Zhang, Jiashu, Alsahfi, Tariq, Butt, Ateeq Ur Rehman

arXiv.org Artificial Intelligence

Enhancing the reasoning capabilities of Large Language Models remains a critical challenge in artificial intelligence. We introduce RDoLT, Recursive Decomposition of Logical Thought prompting, a novel framework that significantly boosts LLM reasoning performance. RDoLT is built on three key innovations: (1) recursively breaking down complex reasoning tasks into sub-tasks of progressive complexity; (2) employing an advanced selection and scoring mechanism to identify the most promising reasoning thoughts; and (3) integrating a knowledge propagation module that mimics human learning by keeping track of strong and weak thoughts for information propagation. Our approach was evaluated across multiple benchmarks, including GSM8K, SVAMP, MultiArith, LastLetterConcatenation, and Gaokao2023 Math. The results demonstrate that RDoLT consistently outperforms existing state-of-the-art techniques, achieving a 90.98 percent accuracy on GSM8K with ChatGPT-4, surpassing state-of-the-art techniques by 6.28 percent. Similar improvements were observed on other benchmarks, with accuracy gains ranging from 5.5 percent to 6.75 percent. These findings highlight RDoLT's potential to advance prompt engineering, offering a more effective and generalizable approach to complex reasoning tasks.


LalaEval: A Holistic Human Evaluation Framework for Domain-Specific Large Language Models

Sun, Chongyan, Lin, Ken, Wang, Shiwei, Wu, Hulong, Fu, Chengfei, Wang, Zhen

arXiv.org Artificial Intelligence

This paper introduces LalaEval, a holistic framework designed for the human evaluation of domain-specific large language models (LLMs). LalaEval proposes a comprehensive suite of end-to-end protocols that cover five main components including domain specification, criteria establishment, benchmark dataset creation, construction of evaluation rubrics, and thorough analysis and interpretation of evaluation outcomes. This initiative aims to fill a crucial research gap by providing a systematic methodology for conducting standardized human evaluations within specific domains, a practice that, despite its widespread application, lacks substantial coverage in the literature and human evaluation are often criticized to be less reliable due to subjective factors, so standardized procedures adapted to the nuanced requirements of specific domains or even individual organizations are in great need. Furthermore, the paper demonstrates the framework's application within the logistics industry, presenting domain-specific evaluation benchmarks, datasets, and a comparative analysis of LLMs for the logistics domain use, highlighting the framework's capacity to elucidate performance differences and guide model selection and development for domain-specific LLMs. Through real-world deployment, the paper underscores the framework's effectiveness in advancing the field of domain-specific LLM evaluation, thereby contributing significantly to the ongoing discussion on LLMs' practical utility and performance in domain-specific applications.


Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning

Matthews, Michael, Beukman, Michael, Ellis, Benjamin, Samvelyan, Mikayel, Jackson, Matthew, Coward, Samuel, Foerster, Jakob

arXiv.org Artificial Intelligence

Benchmarks play a crucial role in the development and analysis of reinforcement learning (RL) algorithms. We identify that existing benchmarks used for research into open-ended learning fall into one of two categories. Either they are too slow for meaningful research to be performed without enormous computational resources, like Crafter, NetHack and Minecraft, or they are not complex enough to pose a significant challenge, like Minigrid and Procgen. To remedy this, we first present Craftax-Classic: a ground-up rewrite of Crafter in JAX that runs up to 250x faster than the Python-native original. A run of PPO using 1 billion environment interactions finishes in under an hour using only a single GPU and averages 90% of the optimal reward. To provide a more compelling challenge we present the main Craftax benchmark, a significant extension of the Crafter mechanics with elements inspired from NetHack. Solving Craftax requires deep exploration, long term planning and memory, as well as continual adaptation to novel situations as more of the world is discovered. We show that existing methods including global and episodic exploration, as well as unsupervised environment design fail to make material progress on the benchmark. We believe that Craftax can for the first time allow researchers to experiment in a complex, open-ended environment with limited computational resources.


Evaluating and Modeling Social Intelligence: A Comparative Study of Human and AI Capabilities

Wang, Junqi, Zhang, Chunhui, Li, Jiapeng, Ma, Yuxi, Niu, Lixing, Han, Jiaheng, Peng, Yujia, Zhu, Yixin, Fan, Lifeng

arXiv.org Artificial Intelligence

Facing the current debate on whether Large Language Models (LLMs) attain near-human intelligence levels (Mitchell & Krakauer, 2023; Bubeck et al., 2023; Kosinski, 2023; Shiffrin & Mitchell, 2023; Ullman, 2023), the current study introduces a benchmark for evaluating social intelligence, one of the most distinctive aspects of human cognition. We developed a comprehensive theoretical framework for social dynamics and introduced two evaluation tasks: Inverse Reasoning (IR) and Inverse Inverse Planning (IIP). Our approach also encompassed a computational model based on recursive Bayesian inference, adept at elucidating diverse human behavioral patterns. Extensive experiments and detailed analyses revealed that humans surpassed the latest GPT models in overall performance, zero-shot learning, one-shot generalization, and adaptability to multi-modalities. Notably, GPT models demonstrated social intelligence only at the most basic order (order = 0), in stark contrast to human social intelligence (order >= 2). Further examination indicated a propensity of LLMs to rely on pattern recognition for shortcuts, casting doubt on their possession of authentic human-level social intelligence. Our codes, dataset, appendix and human data are released at https://github.com/bigai-ai/Evaluate-n-Model-Social-Intelligence.