question difficulty
Think Smart, Not Hard: Difficulty Adaptive Reasoning for Large Audio Language Models
Sheng, Zhichao, Zhou, Shilin, Gong, Chen, Li, Zhenghua
Large Audio Language Models (LALMs), powered by the chain-of-thought (CoT) paradigm, have shown remarkable reasoning capabilities. Intuitively, different problems often require varying depths of reasoning. While some methods can determine whether to reason for a given problem, they typically lack a fine-grained mechanism to modulate how much to reason. This often results in a ``one-size-fits-all'' reasoning depth, which generates redundant overthinking for simple questions while failing to allocate sufficient thought to complex ones. In this paper, we conduct an in-depth analysis of LALMs and find that an effective and efficient LALM should reason smartly by adapting its reasoning depth to the problem's complexity. To achieve this, we propose a difficulty-adaptive reasoning method for LALMs. Specifically, we propose a reward function that dynamically links reasoning length to the model's perceived problem difficulty. This reward encourages shorter, concise reasoning for easy tasks and more elaborate, in-depth reasoning for complex ones. Extensive experiments demonstrate that our method is both effective and efficient, simultaneously improving task performance and significantly reducing the average reasoning length. Further analysis on reasoning structure paradigm offers valuable insights for future work.
- Europe > Austria > Vienna (0.14)
- Asia > Singapore (0.04)
- North America > United States (0.04)
- (7 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation
Carmel, David, Filice, Simone, Horowitz, Guy, Maarek, Yoelle, Shtoff, Alex, Somekh, Oren, Tavory, Ran
With Retrieval Augmented Generation (RAG) becoming more and more prominent in generative AI solutions, there is an emerging need for systematically evaluating their effectiveness. We introduce the LiveRAG benchmark, a publicly available dataset of 895 synthetic questions and answers designed to support systematic evaluation of RAG-based Q&A systems. This synthetic benchmark is derived from the one used during the SIGIR'2025 LiveRAG Challenge, where competitors were evaluated under strict time constraints. It is augmented with information that was not made available to competitors during the Challenge, such as the ground-truth answers, together with their associated supporting claims which were used for evaluating competitors' answers. In addition, each question is associated with estimated difficulty and discriminability scores, derived from applying an Item Response Theory model to competitors' responses. Our analysis highlights the benchmark's questions diversity, the wide range of their difficulty levels, and their usefulness in differentiating between system capabilities. The LiveRAG benchmark will hopefully help the community advance RAG research, conduct systematic evaluation, and develop more robust Q&A systems.
- North America > United States > Washington > King County > Seattle (0.04)
- North America > Canada (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis
Sun, Shuang, Song, Huatong, Wang, Yuhao, Ren, Ruiyang, Jiang, Jinhao, Zhang, Junjie, Bai, Fei, Deng, Jia, Zhao, Wayne Xin, Liu, Zheng, Fang, Lei, Wang, Zhongyuan, Wen, Ji-Rong
Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios requiring multi-step reasoning and iterative information retrieval. However, existing approaches face critical limitations that lack high-quality training trajectories or suffer from the distributional mismatches in simulated environments and prohibitive computational costs for real-world deployment. This paper introduces SimpleDeepSearcher, a lightweight yet effective framework that bridges this gap through strategic data engineering rather than complex training paradigms. Our approach synthesizes high-quality training data by simulating realistic user interactions in live web search environments, coupled with a multi-criteria curation strategy that optimizes the diversity and quality of input and output side. Experiments on five benchmarks across diverse domains demonstrate that SFT on only 871 curated samples yields significant improvements over RL-based baselines. Our work establishes SFT as a viable pathway by systematically addressing the data-scarce bottleneck, offering practical insights for efficient deep search systems. Our code is available at https://github.com/RUCAIBox/SimpleDeepSearcher.
- Information Technology > Information Management (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
Uncertainty-Aware Knowledge Tracing Models
Mitton, Joshua, Bhattacharyya, Prarthana, Abboud, Ralph, Woodhead, Simon
The main focus of research on Knowledge Tracing (KT) models is on model developments with the aim of improving predictive accuracy. Most of these models make the most incorrect predictions when students choose a distractor, leading to student errors going undetected. We present an approach to add new capabilities to KT models by capturing predictive uncertainty and demonstrate that a larger predictive uncertainty aligns with model incorrect predictions. We show that uncertainty in KT models is informative and that this signal would be pedagogically useful for application in an educational learning platform that can be used in a limited resource setting where understanding student ability is necessary.
- North America > United States > New York > New York County > New York City (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations
Zhu, Yubo, Liu, Dongrui, Lin, Zecheng, Tong, Wei, Zhong, Sheng, Shao, Jing
Estimating the difficulty of input questions as perceived by large language models (LLMs) is essential for accurate performance evaluation and adaptive inference. Existing methods typically rely on repeated response sampling, auxiliary models, or fine-tuning the target model itself, which may incur substantial computational costs or compromise generality. In this paper, we propose a novel approach for difficulty estimation that leverages only the hidden representations produced by the target LLM. We model the token-level generation process as a Markov chain and define a value function to estimate the expected output quality given any hidden state. This allows for efficient and accurate difficulty estimation based solely on the initial hidden state, without generating any output tokens. Extensive experiments across both textual and multimodal tasks demonstrate that our method consistently outperforms existing baselines in difficulty estimation. Moreover, we apply our difficulty estimates to guide adaptive reasoning strategies, including Self-Consistency, Best-of-N, and Self-Refine, achieving higher inference efficiency with fewer generated tokens.
- Asia > China > Shanghai > Shanghai (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.35)
Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory
Liu, Yexiang, Li, Zekun, Fang, Zhi, Xu, Nan, He, Ran, Tan, Tieniu
Recently, scaling test-time compute on Large Language Models (LLM) has garnered wide attention. However, there has been limited investigation of how various reasoning prompting strategies perform as scaling. In this paper, we focus on a standard and realistic scaling setting: majority voting. We systematically conduct experiments on 6 LLMs $\times$ 8 prompting strategies $\times$ 6 benchmarks. Experiment results consistently show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought. We analyze this phenomenon and provide theoretical proofs. Additionally, we propose a probabilistic method to efficiently predict scaling performance and identify the best prompting strategy under large sampling times, eliminating the need for resource-intensive inference processes in practical applications. Furthermore, we introduce two ways derived from our theoretical analysis to significantly improve the scaling performance. We hope that our research can promote to re-examine the role of complicated prompting, unleash the potential of simple prompting strategies, and provide new insights for enhancing test-time scaling performance. Code is available at https://github.com/MraDonkey/rethinking_prompting.
- North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Asia > China > Beijing > Beijing (0.04)
Sense and Sensibility: What makes a social robot convincing to high-school students?
Gonzalez-Oliveras, Pablo, Engwall, Olov, Majlesi, Ali Reza
Sense and Sensibility: What makes a social robot convincing to high-school students? Abstract --This study with 40 high-school students demonstrates the high influence of a social educational robot on students' decision-making for a set of eight true-false questions on electric circuits, for which the theory had been covered in the students' courses. The robot argued for the correct answer on six questions and the wrong on two, and 75% of the students were persuaded by the robot to perform beyond their expected capacity, positively when the robot was correct and negatively when it was wrong. Students with more experience of using large language models were even more likely to be influenced by the robot's stance - in particular for the two easiest questions on which the robot was wrong - suggesting that familiarity with AI can increase susceptibility to misinformation by AI. We further examined how three different levels of portrayed robot certainty, displayed using semantics, prosody and facial signals, affected how the students aligned with the robot's answer on specific questions and how convincing they perceived the robot to be on these questions. The students aligned with the robot's answers in 94.4% of the cases when the robot was portrayed as Certain, 82.6% when it was Neutral and 71.4% when it was Uncertain. The alignment was thus high for all conditions, highlighting students' general susceptibility to accept the robot's stance, but alignment in the Uncertain condition was significantly lower than in the Certain. Post-test questionnaire answers further show that students found the robot most convincing when it was portrayed as Certain. These findings highlight the need for educational robots to adjust their display of certainty based on the reliability of the information they convey, to promote students' critical thinking and reduce undue influence. Educational robots are becoming more common and they have significant potential in, e.g., STEM (science, technology, engineering and mathematics) education [46, 69, 17], offering students realistic and natural interactions, not the least by employing Large Language Models (LLMs), as demonstrated in several recent studies [41, 68, 67]. However, it is also well-known that while the LLMs' linguistic proficiency is often astonishing, their factual "knowledge" in STEM subjects is flawed, and incorrect statements occur frequently [34, 60]. Since robots can exert high informational social influence [38, 24, 25, 55, 56] and students will align with the robot's views to large extents [27], the positive as well as negative effects of learning with a social robot need to be considered: Students need to use critical thinking to decide if they should accept the robot's propositions [63]. Educators need to understand which students are more at risk of being misled by a robot presenting incorrect STEM facts, to provide in-time support.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > California > Alameda County > Berkeley (0.14)
- Europe > Sweden > Stockholm > Stockholm (0.04)
- (7 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study > Negative Result (0.93)
- Education > Educational Setting > K-12 Education > Secondary School (1.00)
- Education > Curriculum > Subject-Specific Education (1.00)
Advancing Question Generation with Joint Narrative and Difficulty Control
Leite, Bernardo, Cardoso, Henrique Lopes
Question Generation (QG), the task of automatically generating questions from a source input, has seen significant progress in recent years. Difficulty-controllable QG (DCQG) enables control over the difficulty level of generated questions while considering the learner's ability. Additionally, narrative-controllable QG (NCQG) allows control over the narrative aspects embedded in the questions. However, research in QG lacks a focus on combining these two types of control, which is important for generating questions tailored to educational purposes. To address this gap, we propose a strategy for Joint Narrative and Difficulty Control, enabling simultaneous control over these two attributes in the generation of reading comprehension questions. Our evaluation provides preliminary evidence that this approach is feasible, though it is not effective across all instances. Our findings highlight the conditions under which the strategy performs well and discuss the trade-offs associated with its application.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Switzerland (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- (11 more...)
AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs
Ding, Xuanwen, Pan, Chengjun, Li, Zejun, Zhang, Jiwen, Wang, Siyuan, Wei, Zhongyu
Evaluating multimodal large language models (MLLMs) is increasingly expensive, as the growing size and cross-modality complexity of benchmarks demand significant scoring efforts. To tackle with this difficulty, we introduce AutoJudger, an agent-driven framework for efficient and adaptive benchmarking of MLLMs that tackles this escalating cost. AutoJudger employs the Item Response Theory (IRT) to estimate the question difficulty and an autonomous evaluation agent to dynamically select the most informative test questions based on the model's real-time performance. Specifically, AutoJudger incorporates two pivotal components: a semantic-aware retrieval mechanism to ensure that selected questions cover diverse and challenging scenarios across both vision and language modalities, and a dynamic memory that maintains contextual statistics of previously evaluated questions to guide coherent and globally informed question selection throughout the evaluation process. Extensive experiments on four representative multimodal benchmarks demonstrate that our adaptive framework dramatically reduces evaluation expenses, i.e. AutoJudger uses only 4% of the data to achieve over 90% ranking accuracy with the full benchmark evaluation on MMT-Bench.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > Mississippi (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- (3 more...)
- Information Technology (0.46)
- Education (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (2 more...)
Analyzing Feedback Mechanisms in AI-Generated MCQs: Insights into Readability, Lexical Properties, and Levels of Challenge
Yaacoub, Antoun, Assaghir, Zainab, Prevost, Lionel, Da-Rugna, Jérôme
Artificial Intelligence (AI)-generated feedback in educational settings has garnered considerable attention due to its potential to enhance learning outcomes. However, a comprehensive understanding of the linguistic characteristics of AI-generated feedback, including readability, lexical richness, and adaptability across varying challenge levels, remains limited. This study delves into the linguistic and structural attributes of feedback generated by Google's Gemini 1.5-flash text model for computer science multiple-choice questions (MCQs). A dataset of over 1,200 MCQs was analyzed, considering three difficulty levels (easy, medium, hard) and three feedback tones (supportive, neutral, challenging). Key linguistic metrics, such as length, readability scores (Flesch-Kincaid Grade Level), vocabulary richness, and lexical density, were computed and examined. A fine-tuned RoBERTa-based multi-task learning (MTL) model was trained to predict these linguistic properties, achieving a Mean Absolute Error (MAE) of 2.0 for readability and 0.03 for vocabulary richness. The findings reveal significant interaction effects between feedback tone and question difficulty, demonstrating the dynamic adaptation of AI-generated feedback within diverse educational contexts. These insights contribute to the development of more personalized and effective AI-driven feedback mechanisms, highlighting the potential for improved learning outcomes while underscoring the importance of ethical considerations in their design and deployment.
- Europe > France > Île-de-France > Paris > Paris (0.05)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Europe > Spain > Valencian Community > Valencia Province > Valencia (0.04)
- Asia > Middle East > Lebanon > Beirut Governorate > Beirut (0.04)