AITopics | question difficulty

Collaborating Authors

question difficulty

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Think Smart, Not Hard: Difficulty Adaptive Reasoning for Large Audio Language Models

Sheng, Zhichao, Zhou, Shilin, Gong, Chen, Li, Zhenghua

arXiv.org Artificial IntelligenceNov-20-2025

Large Audio Language Models (LALMs), powered by the chain-of-thought (CoT) paradigm, have shown remarkable reasoning capabilities. Intuitively, different problems often require varying depths of reasoning. While some methods can determine whether to reason for a given problem, they typically lack a fine-grained mechanism to modulate how much to reason. This often results in a ``one-size-fits-all'' reasoning depth, which generates redundant overthinking for simple questions while failing to allocate sufficient thought to complex ones. In this paper, we conduct an in-depth analysis of LALMs and find that an effective and efficient LALM should reason smartly by adapting its reasoning depth to the problem's complexity. To achieve this, we propose a difficulty-adaptive reasoning method for LALMs. Specifically, we propose a reward function that dynamically links reasoning length to the model's perceived problem difficulty. This reward encourages shorter, concise reasoning for easy tasks and more elaborate, in-depth reasoning for complex ones. Extensive experiments demonstrate that our method is both effective and efficient, simultaneously improving task performance and significantly reducing the average reasoning length. Further analysis on reasoning structure paradigm offers valuable insights for future work.

large language model, machine learning, natural language, (23 more...)

arXiv.org Artificial Intelligence

2509.2196

Country: Europe > Austria (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

LiveRAG: A diverse Q&A dataset with varying difficulty level for RAG evaluation

Carmel, David, Filice, Simone, Horowitz, Guy, Maarek, Yoelle, Shtoff, Alex, Somekh, Oren, Tavory, Ran

arXiv.org Artificial IntelligenceNov-19-2025

With Retrieval Augmented Generation (RAG) becoming more and more prominent in generative AI solutions, there is an emerging need for systematically evaluating their effectiveness. We introduce the LiveRAG benchmark, a publicly available dataset of 895 synthetic questions and answers designed to support systematic evaluation of RAG-based Q&A systems. This synthetic benchmark is derived from the one used during the SIGIR'2025 LiveRAG Challenge, where competitors were evaluated under strict time constraints. It is augmented with information that was not made available to competitors during the Challenge, such as the ground-truth answers, together with their associated supporting claims which were used for evaluating competitors' answers. In addition, each question is associated with estimated difficulty and discriminability scores, derived from applying an Item Response Theory model to competitors' responses. Our analysis highlights the benchmark's questions diversity, the wide range of their difficulty levels, and their usefulness in differentiating between system capabilities. The LiveRAG benchmark will hopefully help the community advance RAG research, conduct systematic evaluation, and develop more robust Q&A systems.

large language model, machine learning, question answering, (20 more...)

arXiv.org Artificial Intelligence

2511.14531

Country:

North America > United States (0.28)
Asia > Middle East (0.28)

Genre: Research Report (0.50)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis

Sun, Shuang, Song, Huatong, Wang, Yuhao, Ren, Ruiyang, Jiang, Jinhao, Zhang, Junjie, Bai, Fei, Deng, Jia, Zhao, Wayne Xin, Liu, Zheng, Fang, Lei, Wang, Zhongyuan, Wen, Ji-Rong

arXiv.org Artificial IntelligenceOct-9-2025

Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios requiring multi-step reasoning and iterative information retrieval. However, existing approaches face critical limitations that lack high-quality training trajectories or suffer from the distributional mismatches in simulated environments and prohibitive computational costs for real-world deployment. This paper introduces SimpleDeepSearcher, a lightweight yet effective framework that bridges this gap through strategic data engineering rather than complex training paradigms. Our approach synthesizes high-quality training data by simulating realistic user interactions in live web search environments, coupled with a multi-criteria curation strategy that optimizes the diversity and quality of input and output side. Experiments on five benchmarks across diverse domains demonstrate that SFT on only 871 curated samples yields significant improvements over RL-based baselines. Our work establishes SFT as a viable pathway by systematically addressing the data-scarce bottleneck, offering practical insights for efficient deep search systems. Our code is available at https://github.com/RUCAIBox/SimpleDeepSearcher.

arxiv preprint arxiv, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2505.16834

Country:

Asia (0.28)
Europe > Austria (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Uncertainty-Aware Knowledge Tracing Models

Mitton, Joshua, Bhattacharyya, Prarthana, Abboud, Ralph, Woodhead, Simon

arXiv.org Artificial IntelligenceSep-29-2025

The main focus of research on Knowledge Tracing (KT) models is on model developments with the aim of improving predictive accuracy. Most of these models make the most incorrect predictions when students choose a distractor, leading to student errors going undetected. We present an approach to add new capabilities to KT models by capturing predictive uncertainty and demonstrate that a larger predictive uncertainty aligns with model incorrect predictions. We show that uncertainty in KT models is informative and that this signal would be pedagogically useful for application in an educational learning platform that can be used in a limited resource setting where understanding student ability is necessary.

artificial intelligence, machine learning, student, (18 more...)

arXiv.org Artificial Intelligence

2509.21514

Country: North America > United States > New York > New York County > New York City (0.14)

Genre: Research Report (0.82)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Add feedback

The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations

Zhu, Yubo, Liu, Dongrui, Lin, Zecheng, Tong, Wei, Zhong, Sheng, Shao, Jing

arXiv.org Artificial IntelligenceSep-17-2025

Estimating the difficulty of input questions as perceived by large language models (LLMs) is essential for accurate performance evaluation and adaptive inference. Existing methods typically rely on repeated response sampling, auxiliary models, or fine-tuning the target model itself, which may incur substantial computational costs or compromise generality. In this paper, we propose a novel approach for difficulty estimation that leverages only the hidden representations produced by the target LLM. We model the token-level generation process as a Markov chain and define a value function to estimate the expected output quality given any hidden state. This allows for efficient and accurate difficulty estimation based solely on the initial hidden state, without generating any output tokens. Extensive experiments across both textual and multimodal tasks demonstrate that our method consistently outperforms existing baselines in difficulty estimation. Moreover, we apply our difficulty estimates to guide adaptive reasoning strategies, including Self-Consistency, Best-of-N, and Self-Refine, achieving higher inference efficiency with fewer generated tokens.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2509.12886

Country: Asia > China (0.28)

Genre: Research Report > Promising Solution (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.35)

Add feedback

Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory

Liu, Yexiang, Li, Zekun, Fang, Zhi, Xu, Nan, He, Ran, Tan, Tieniu

arXiv.org Artificial IntelligenceAug-18-2025

Recently, scaling test-time compute on Large Language Models (LLM) has garnered wide attention. However, there has been limited investigation of how various reasoning prompting strategies perform as scaling. In this paper, we focus on a standard and realistic scaling setting: majority voting. We systematically conduct experiments on 6 LLMs $\times$ 8 prompting strategies $\times$ 6 benchmarks. Experiment results consistently show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought. We analyze this phenomenon and provide theoretical proofs. Additionally, we propose a probabilistic method to efficiently predict scaling performance and identify the best prompting strategy under large sampling times, eliminating the need for resource-intensive inference processes in practical applications. Furthermore, we introduce two ways derived from our theoretical analysis to significantly improve the scaling performance. We hope that our research can promote to re-examine the role of complicated prompting, unleash the potential of simple prompting strategies, and provide new insights for enhancing test-time scaling performance. Code is available at https://github.com/MraDonkey/rethinking_prompting.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2025.acl-long.1356

2505.10981

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Sense and Sensibility: What makes a social robot convincing to high-school students?

Gonzalez-Oliveras, Pablo, Engwall, Olov, Majlesi, Ali Reza

arXiv.org Artificial IntelligenceJun-17-2025

Sense and Sensibility: What makes a social robot convincing to high-school students? Abstract --This study with 40 high-school students demonstrates the high influence of a social educational robot on students' decision-making for a set of eight true-false questions on electric circuits, for which the theory had been covered in the students' courses. The robot argued for the correct answer on six questions and the wrong on two, and 75% of the students were persuaded by the robot to perform beyond their expected capacity, positively when the robot was correct and negatively when it was wrong. Students with more experience of using large language models were even more likely to be influenced by the robot's stance - in particular for the two easiest questions on which the robot was wrong - suggesting that familiarity with AI can increase susceptibility to misinformation by AI. We further examined how three different levels of portrayed robot certainty, displayed using semantics, prosody and facial signals, affected how the students aligned with the robot's answer on specific questions and how convincing they perceived the robot to be on these questions. The students aligned with the robot's answers in 94.4% of the cases when the robot was portrayed as Certain, 82.6% when it was Neutral and 71.4% when it was Uncertain. The alignment was thus high for all conditions, highlighting students' general susceptibility to accept the robot's stance, but alignment in the Uncertain condition was significantly lower than in the Certain. Post-test questionnaire answers further show that students found the robot most convincing when it was portrayed as Certain. These findings highlight the need for educational robots to adjust their display of certainty based on the reliability of the information they convey, to promote students' critical thinking and reduce undue influence. Educational robots are becoming more common and they have significant potential in, e.g., STEM (science, technology, engineering and mathematics) education [46, 69, 17], offering students realistic and natural interactions, not the least by employing Large Language Models (LLMs), as demonstrated in several recent studies [41, 68, 67]. However, it is also well-known that while the LLMs' linguistic proficiency is often astonishing, their factual "knowledge" in STEM subjects is flawed, and incorrect statements occur frequently [34, 60]. Since robots can exert high informational social influence [38, 24, 25, 55, 56] and students will align with the robot's views to large extents [27], the positive as well as negative effects of learning with a social robot need to be considered: Students need to use critical thinking to decide if they should accept the robot's propositions [63]. Educators need to understand which students are more at risk of being misled by a robot presenting incorrect STEM facts, to provide in-time support.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2506.12507

Country:

Europe (0.68)
North America > United States > California (0.67)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study > Negative Result (0.93)

Industry: Education > Educational Setting > K-12 Education > Secondary School (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Robots > Robots in the Home (0.90)
Information Technology > Artificial Intelligence > Robots > Humanoid Robots (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

Advancing Question Generation with Joint Narrative and Difficulty Control

Leite, Bernardo, Cardoso, Henrique Lopes

arXiv.org Artificial IntelligenceJun-10-2025

Question Generation (QG), the task of automatically generating questions from a source input, has seen significant progress in recent years. Difficulty-controllable QG (DCQG) enables control over the difficulty level of generated questions while considering the learner's ability. Additionally, narrative-controllable QG (NCQG) allows control over the narrative aspects embedded in the questions. However, research in QG lacks a focus on combining these two types of control, which is important for generating questions tailored to educational purposes. To address this gap, we propose a strategy for Joint Narrative and Difficulty Control, enabling simultaneous control over these two attributes in the generation of reading comprehension questions. Our evaluation provides preliminary evidence that this approach is feasible, though it is not effective across all instances. Our findings highlight the conditions under which the strategy performs well and discuss the trade-offs associated with its application.

difficulty level, natural language, question answering, (16 more...)

arXiv.org Artificial Intelligence

2506.06812

Country:

Europe (0.93)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (1.00)

Industry: Education > Assessment & Standards > Student Performance (0.55)

Technology: Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)

Add feedback

AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs

Ding, Xuanwen, Pan, Chengjun, Li, Zejun, Zhang, Jiwen, Wang, Siyuan, Wei, Zhongyu

arXiv.org Artificial IntelligenceMay-28-2025

Evaluating multimodal large language models (MLLMs) is increasingly expensive, as the growing size and cross-modality complexity of benchmarks demand significant scoring efforts. To tackle with this difficulty, we introduce AutoJudger, an agent-driven framework for efficient and adaptive benchmarking of MLLMs that tackles this escalating cost. AutoJudger employs the Item Response Theory (IRT) to estimate the question difficulty and an autonomous evaluation agent to dynamically select the most informative test questions based on the model's real-time performance. Specifically, AutoJudger incorporates two pivotal components: a semantic-aware retrieval mechanism to ensure that selected questions cover diverse and challenging scenarios across both vision and language modalities, and a dynamic memory that maintains contextual statistics of previously evaluated questions to guide coherent and globally informed question selection throughout the evaluation process. Extensive experiments on four representative multimodal benchmarks demonstrate that our adaptive framework dramatically reduces evaluation expenses, i.e. AutoJudger uses only 4% of the data to achieve over 90% ranking accuracy with the full benchmark evaluation on MMT-Bench.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2505.21389

Country: North America > United States > California (0.46)

Genre: Research Report (1.00)

Industry:

Information Technology (0.46)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Analyzing Feedback Mechanisms in AI-Generated MCQs: Insights into Readability, Lexical Properties, and Levels of Challenge

Yaacoub, Antoun, Assaghir, Zainab, Prevost, Lionel, Da-Rugna, Jérôme

arXiv.org Artificial IntelligenceMay-1-2025

Artificial Intelligence (AI)-generated feedback in educational settings has garnered considerable attention due to its potential to enhance learning outcomes. However, a comprehensive understanding of the linguistic characteristics of AI-generated feedback, including readability, lexical richness, and adaptability across varying challenge levels, remains limited. This study delves into the linguistic and structural attributes of feedback generated by Google's Gemini 1.5-flash text model for computer science multiple-choice questions (MCQs). A dataset of over 1,200 MCQs was analyzed, considering three difficulty levels (easy, medium, hard) and three feedback tones (supportive, neutral, challenging). Key linguistic metrics, such as length, readability scores (Flesch-Kincaid Grade Level), vocabulary richness, and lexical density, were computed and examined. A fine-tuned RoBERTa-based multi-task learning (MTL) model was trained to predict these linguistic properties, achieving a Mean Absolute Error (MAE) of 2.0 for readability and 0.03 for vocabulary richness. The findings reveal significant interaction effects between feedback tone and question difficulty, demonstrating the dynamic adaptation of AI-generated feedback within diverse educational contexts. These insights contribute to the development of more personalized and effective AI-driven feedback mechanisms, highlighting the potential for improved learning outcomes while underscoring the importance of ethical considerations in their design and deployment.

difficulty level, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2504.21013

Country: Europe > France (0.15)

Genre: Research Report > New Finding (1.00)

Industry: Education > Educational Setting (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback