Goto

Collaborating Authors

 deduction


SIRIUS : Contexual Sparisty with Correction for Efficient LLMs

Neural Information Processing Systems

With the blossom of large language models (LLM), inference efficiency becomes increasingly important. Various approximate methods are proposed to reduce the cost at inference time. Contextual Sparsity (CS) is appealing for its training-free nature and its ability to reach a higher compression ratio seemingly without significant performance degradation. However, after a comprehensive evaluation of contextual sparsity methods on various complex generation tasks, we find that although CS succeeds in prompt-understanding tasks, it significantly degrades the model performance for reasoning, deduction, and knowledge-based tasks. Despite the gap in end-to-end accuracy, we observed that sparse models and original models often share the general problem-solving logic and require only a few token corrections to recover the original model performance. This paper introduces SIRIUS, an efficient correction mechanism, which significantly boosts CS models on reasoning tasks while maintaining its efficiency gain. SIRIUS is evaluated on 6 models with 8 difficult generation tasks in reasoning, deduction, and coding and shows consistent effectiveness and efficiency. Also, we carefully develop a system implementation for SIRIUS and show that SIRIUS delivers theoretical latency reduction with roughly a 20% reduction in latency for 8B model on-chip and a 35% reduction in latency for 70B model offloading.


LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation

Byun, Grace, Rajwal, Swati, Choi, Jinho D.

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are increasingly explored for educational tasks such as grading, yet their alignment with human evaluation in real classrooms remains underexamined. In this study, we investigate the feasibility of using an LLM (GPT-4o) to evaluate short-answer quizzes and project reports in an undergraduate Computational Linguistics course. We collect responses from approximately 50 students across five quizzes and receive project reports from 14 teams. LLM-generated scores are compared against human evaluations conducted independently by the course teaching assistants (TAs). Our results show that GPT-4o achieves strong correlation with human graders (up to 0.98) and exact score agreement in 55\% of quiz cases. For project reports, it also shows strong overall alignment with human grading, while exhibiting some variability in scoring technical, open-ended responses. We release all code and sample data to support further research on LLMs in educational assessment. This work highlights both the potential and limitations of LLM-based grading systems and contributes to advancing automated grading in real-world academic settings.


VoiceAgentEval: A Dual-Dimensional Benchmark for Expert-Level Intelligent Voice-Agent Evaluation of Xbench's Professional-Aligned Series

Xu, Pengyu, Li, Shijia, Sun, Ao, Zhang, Feng, Li, Yahan, Wu, Bo, Ma, Zhanyu, Li, Jiguo, Xu, Jun, Gao, Jiuchong, Hao, Jinghua, He, Renqing, Wang, Rui, Liu, Yang, Hu, Xiaobo, Yang, Fan, Zheng, Jia, Yao, Guanghua

arXiv.org Artificial Intelligence

We propose OutboundEval, a comprehensive benchmark for evaluating large language models (LLMs) in expert-level intelligent outbound calling scenarios. Unlike existing methods that suffer from three key limitations - insufficient dataset diversity and category coverage, unrealistic user simulation, and inaccurate evaluation metrics - OutboundEval addresses these issues through a structured framework. First, we design a benchmark spanning six major business domains and 30 representative sub-scenarios, each with scenario-specific process decomposition, weighted scoring, and domain-adaptive metrics. Second, we develop a large-model-driven User Simulator that generates diverse, persona-rich virtual users with realistic behaviors, emotional variability, and communication styles, providing a controlled yet authentic testing environment. Third, we introduce a dynamic evaluation method that adapts to task variations, integrating automated and human-in-the-loop assessment to measure task execution accuracy, professional knowledge application, adaptability, and user experience quality. Experiments on 12 state-of-the-art LLMs reveal distinct trade-offs between expert-level task completion and interaction fluency, offering practical insights for building reliable, human-like outbound AI systems. OutboundEval establishes a practical, extensible, and domain-oriented standard for benchmarking LLMs in professional applications.


Modeling Hierarchical Thinking in Large Reasoning Models

Shahariar, G M, Nazari, Ali, Shayegani, Erfan, Abu-Ghazaleh, Nael

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated remarkable reasoning abilities when they generate step-by-step solutions, known as chain-of-thought (CoT) reasoning. When trained to using chain-of-thought reasoning examples, the resulting models (called Large Reasoning Models, or LRMs) appear to learn hierarchical thinking strategies similar to those used by humans. However, understanding LRMs emerging reasoning capabilities remains a difficult open problem, with many potential important applications including improving training and understanding robustness. In this paper, we adopt a memoryless Finite State Machine formulation to approximate LRM's emerging hierarchical reasoning dynamics as a structured, interpretable abstraction. We identify a small set of discrete reasoning states including - initialization, deduction, augmentation-strategy, uncertainty-estimation, backtracking, and final-conclusion that capture the high-level states present in the model's reasoning process. By annotating each step of a model's CoT with these states, we can represent the reasoning trajectory as a transition sequence through the state graph. This FSM formulation provides a systematic way to analyze, interpret and visualize how different models approach problems. We describe the FSM model, provide examples of CoT annotations under this scheme, and discuss how it can shed light on differences between available models in their approach to reasoning. Our results demonstrate that this FSM-based analysis reveals distinct reasoning patterns and potential shortcomings, offering a new lens to evaluate and improve LLM reasoning.


Understanding Reasoning in Thinking Language Models via Steering Vectors

Venhoff, Constantin, Arcuschin, Iván, Torr, Philip, Conmy, Arthur, Nanda, Neel

arXiv.org Artificial Intelligence

Recent advances in large language models (LLMs) have led to the development of thinking language models that generate extensive internal reasoning chains before producing responses. While these models achieve improved performance, controlling their reasoning processes remains challenging. This work presents a steering approach for thinking LLMs by analyzing and manipulating specific reasoning behaviors in DeepSeek-R1-Distill models. Through a systematic experiment on 500 tasks across 10 diverse categories, we identify several reasoning behaviors exhibited by thinking models, including expressing uncertainty, generating examples for hypothesis validation, and backtracking in reasoning chains. We demonstrate that these behaviors are mediated by linear directions in the model's activation space and can be controlled using steering vectors. By extracting and applying these vectors, we provide a method to modulate specific aspects of the model's reasoning process, such as its tendency to backtrack or express uncertainty. Our approach offers practical tools for steering reasoning processes in thinking models in a controlled and interpretable manner. We validate our steering method using three DeepSeek-R1-Distill models, demonstrating consistent control across different model architectures.


Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning

Zhu, Chenghao, Tao, Meiling, Wang, Tiannan, Ding, Dongyi, Jiang, Yuchen Eleanor, Zhou, Wangchunshu

arXiv.org Artificial Intelligence

Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning. Under a rigorous length-controlled evaluation, our method substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an average 11\% win-rate improvement, and personalized Qwen2.5-14B model surpasses the performance of GPT-4.1. These results demonstrate a practical path to faithful, efficient, and controllable personalization.


Reliable Fine-Grained Evaluation of Natural Language Math Proofs

Ma, Wenjie, Cojocaru, Andrei, Kolhe, Neel, Louie, Bradley, Sharif, Robin Said, Zhang, Haihan, Zhuang, Vincent, Zaharia, Matei, Min, Sewon

arXiv.org Artificial Intelligence

Recent advances in large language models (LLMs) for mathematical reasoning have largely focused on tasks with easily verifiable final answers; however, generating and verifying natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap. To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0-7 scale to model-generated math proofs. To enable this study, we introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions (USAMO, IMO, Putnam, etc) and 435 LLM-generated solutions from Gemini-2.5-pro, o3, and DeepSeek-R1. %with expert gradings. Using ProofBench as a testbed, we systematically explore the evaluator design space across key axes: the backbone model, input context, instructions and evaluation workflow. Our analysis delivers ProofGrader, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method; it achieves a low Mean Absolute Error (MAE) of 0.926 against expert scores, significantly outperforming naive baselines. Finally, we demonstrate its practical utility in a best-of-$n$ selection task: at $n=16$, ProofGrader achieves an average score of 4.14 (out of 7), closing 78% of the gap between a naive binary evaluator (2.48) and the human oracle (4.62), highlighting its potential to advance downstream proof generation.


Extended Triangular Method: A Generalized Algorithm for Contradiction Separation Based Automated Deduction

Xu, Yang, Chen, Shuwei, Liu, Jun, Cao, Feng, He, Xingxing

arXiv.org Artificial Intelligence

Automated deduction lies at the core of Artificial Intelligence (AI), underpinning theorem proving, formal verification, and logical reasoning. Despite decades of progress, reconciling deductive completeness with computational efficiency remains an enduring challenge. Traditional reasoning calculi, grounded in binary resolution, restrict inference to pairwise clause interactions and thereby limit deductive synergy among multiple clauses. The Contradiction Separation Extension (CSE) framework, introduced in 2018, proposed a dynamic multi-clause reasoning theory that redefined logical inference as a process of contradiction separation rather than sequential resolution. While that work established the theoretical foundation, its algorithmic realization remained unformalized and unpublished. This work presents the Extended Triangular Method (ETM), a generalized contradiction-construction algorithm that formalizes and extends the internal mechanisms of contradiction separation. The ETM unifies multiple contradiction-building strategies, including the earlier Standard Extension method, within a triangular geometric framework that supports flexible clause interaction and dynamic synergy. ETM serves as the algorithmic core of several high-performance theorem provers, CSE, CSE-E, CSI-E, and CSI-Enig, whose competitive results in standard first-order benchmarks (TPTP problem sets and CASC 2018-2015) empirically validate the effectiveness and generality of the proposed approach. By bridging theoretical abstraction and operational implementation, ETM advances the contradiction separation paradigm into a generalized, scalable, and practically competitive model for automated reasoning, offering new directions for future research in logical inference and theorem proving.


RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows

Mahdavi, Hamed, Mahdavinia, Pouria, Malek, Samira, Mohammadipour, Pegah, Hashemi, Alireza, Daliri, Majid, Farhadi, Alireza, Khasahmadi, Amir, Mireshghallah, Niloofar, Honavar, Vasant

arXiv.org Artificial Intelligence

State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.


Data Skeleton Learning: Scalable Active Clustering with Sparse Graph Structures

Xie, Wen-Bo, Fu, Xun, Chen, Bin, Lee, Yan-Li, Deng, Tao, Zou, Tian, Wang, Xin, Liu, Zhen, Srivastavad, Jaideep

arXiv.org Artificial Intelligence

In this work, we focus on the efficiency and scalability of pairwise constraint-based active clustering, crucial for processing large-scale data in applications such as data mining, knowledge annotation, and AI model pre-training. Our goals are threefold: (1) to reduce computational costs for iterative clustering updates; (2) to enhance the impact of user-provided constraints to minimize annotation requirements for precise clustering; and (3) to cut down memory usage in practical deployments. To achieve these aims, we propose a graph-based active clustering algorithm that utilizes two sparse graphs: one for representing relationships between data (our proposed data skeleton) and another for updating this data skeleton. These two graphs work in concert, enabling the refinement of connected subgraphs within the data skeleton to create nested clusters. Our empirical analysis confirms that the proposed algorithm consistently facilitates more accurate clustering with dramatically less input of user-provided constraints, and outperforms its counterparts in terms of computational performance and scalability, while maintaining robustness across various distance metrics.