solution step
Thinking Out Loud: Do Reasoning Models Know When They're Right?
Zeng, Qingcheng, Xuan, Weihao, Cui, Leyang, Voigt, Rob
Large reasoning models (LRMs) have recently demonstrated impressive capabilities in complex reasoning tasks by leveraging increased test-time computation and exhibiting behaviors reminiscent of human-like self-reflection. While LRMs show a clear capacity for valuable self-reflection, how this ability interacts with other model behaviors remains underexplored. We investigate this connection by analyzing verbalized confidence, how models articulate their certainty, as a lens into the nature of self-reflection in LRMs. We find that supervised fine-tuning on reasoning traces (i.e., distillation) and reinforcement learning can improve verbalized calibration in reasoning-intensive settings in a progressive, laddered fashion. However, our results also indicate that reasoning models may possess a diminished awareness of their own knowledge boundaries, as evidenced by significantly lower "I don't know" response rates on factuality benchmarks. Moreover, we examine the relationship between verbalized confidence and reasoning chains, finding that models tend to express higher confidence when providing shorter or less elaborate reasoning. Our findings highlight how reasoning-oriented training can enhance performance in reasoning-centric tasks while potentially incurring a "reasoning tax," a cost reflected in the model's reduced ability to accurately recognize the limits of its own knowledge in small-scale models. More broadly, our work showcases how this erosion of knowledge boundaries can compromise model faithfulness, as models grow more confident without a commensurate understanding of when they should abstain.
- North America > United States > California > Yolo County > Davis (0.04)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (2 more...)
Personalized Exercise Recommendation with Semantically-Grounded Knowledge Tracing
Ozyurt, Yilmazcan, Almaci, Tunaberk, Feuerriegel, Stefan, Sachan, Mrinmaya
We introduce ExRec, a general framework for personalized exercise recommendation with semantically-grounded knowledge tracing. Our method builds on the observation that existing exercise recommendation approaches simulate student performance via knowledge tracing (KT) but they often overlook two key aspects: (a) the semantic content of questions and (b) the sequential, structured progression of student learning. To address this, our ExRec presents an end-to-end pipeline, from annotating the KCs of questions and learning their semantic representations to training KT models and optimizing several reinforcement learning (RL) methods. Moreover, we improve standard Q-learning-based continuous RL methods via a tailored model-based value estimation (MVE) approach that directly leverages the components of KT model in estimating cumulative knowledge improvement. We validate the effectiveness of our ExRec using various RL methods across four real-world tasks with different educational goals in online math learning. We further show that ExRec generalizes robustly to new, unseen questions and that it produces interpretable student learning trajectories. Together, our findings highlight the promise of KT-guided RL for effective personalization in education.
- Europe > Switzerland > Zürich > Zürich (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Asia > Middle East > Jordan (0.04)
- Workflow (0.94)
- Research Report > New Finding (0.66)
- Instructional Material > Course Syllabus & Notes (0.46)
- Education > Educational Setting > Online (1.00)
- Education > Educational Technology > Educational Software > Computer Based Training (0.93)
- Education > Curriculum > Subject-Specific Education (0.87)
- Education > Assessment & Standards (0.66)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Decomposing Elements of Problem Solving: What "Math" Does RL Teach?
Qin, Tian, Park, Core Francisco, Kwun, Mujin, Walsman, Aaron, Malach, Eran, Anand, Nikhil, Tanaka, Hidenori, Alvarez-Melis, David
Mathematical reasoning tasks have become prominent benchmarks for assessing the reasoning capabilities of LLMs, especially with reinforcement learning (RL) methods such as GRPO showing significant performance gains. However, accuracy metrics alone do not support fine-grained assessment of capabilities and fail to reveal which problem-solving skills have been internalized. To better understand these capabilities, we propose to decompose problem solving into fundamental capabilities: Plan (mapping questions to sequences of steps), Execute (correctly performing solution steps), and Verify (identifying the correctness of a solution). Empirically, we find that GRPO mainly enhances the execution skill-improving execution robustness on problems the model already knows how to solve-a phenomenon we call temperature distillation. More importantly, we show that RL-trained models struggle with fundamentally new problems, hitting a 'coverage wall' due to insufficient planning skills. To explore RL's impact more deeply, we construct a minimal, synthetic solution-tree navigation task as an analogy for mathematical problem-solving. This controlled setup replicates our empirical findings, confirming RL primarily boosts execution robustness. Importantly, in this setting, we identify conditions under which RL can potentially overcome the coverage wall through improved exploration and generalization to new solution paths. Our findings provide insights into the role of RL in enhancing LLM reasoning, expose key limitations, and suggest a path toward overcoming these barriers. Code is available at https://github.com/cfpark00/RL-Wall.
- Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
- North America > United States (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (3 more...)
Beyond Words: How Large Language Models Perform in Quantitative Management Problem-Solving
This study examines how Large Language Models (LLMs) perform when tackling quantitative management decision problems in a zero-shot setting. Drawing on 900 responses generated by five leading models across 20 diverse managerial scenarios, our analysis explores whether these base models can deliver accurate numerical decisions under varying presentation formats, scenario complexities, and repeated attempts. Contrary to prior findings, we observed no significant effects of text presentation format (direct, narrative, or tabular) or text length on accuracy. However, scenario complexity -- particularly in terms of constraints and irrelevant parameters -- strongly influenced performance, often degrading accuracy. Surprisingly, the models handled tasks requiring multiple solution steps more effectively than expected. Notably, only 28.8\% of responses were exactly correct, highlighting limitations in precision. We further found no significant ``learning effect'' across iterations: performance remained stable across repeated queries. Nonetheless, significant variations emerged among the five tested LLMs, with some showing superior binary accuracy. Overall, these findings underscore both the promise and the pitfalls of harnessing LLMs for complex quantitative decision-making, informing managers and researchers about optimal deployment strategies.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Do Large Language Models Truly Grasp Mathematics? An Empirical Exploration From Cognitive Psychology
Xie, Wei, Ma, Shuoyoucheng, Wang, Zhenhua, Wang, Enze, Chen, Kai, Sun, Xiaobing, Wang, Baosheng
The cognitive mechanism by which Large Language Models (LLMs) solve mathematical problems remains a widely debated and unresolved issue. Currently, there is little interpretable experimental evidence that connects LLMs' problem-solving with human cognitive psychology.To determine if LLMs possess human-like mathematical reasoning, we modified the problems used in the human Cognitive Reflection Test (CRT). Our results show that, even with the use of Chains of Thought (CoT) prompts, mainstream LLMs, including the latest o1 model (noted for its reasoning capabilities), have a high error rate when solving these modified CRT problems. Specifically, the average accuracy rate dropped by up to 50% compared to the original questions.Further analysis of LLMs' incorrect answers suggests that they primarily rely on pattern matching from their training data, which aligns more with human intuition (System 1 thinking) rather than with human-like reasoning (System 2 thinking). This finding challenges the belief that LLMs have genuine mathematical reasoning abilities comparable to humans. As a result, this work may adjust overly optimistic views on LLMs' progress towards artificial general intelligence.
Automated Knowledge Concept Annotation and Question Representation Learning for Knowledge Tracing
Ozyurt, Yilmazcan, Feuerriegel, Stefan, Sachan, Mrinmaya
Knowledge tracing (KT) is a popular approach for modeling students' learning progress over time, which can enable more personalized and adaptive learning. However, existing KT approaches face two major limitations: (1) they rely heavily on expert-defined knowledge concepts (KCs) in questions, which is timeconsuming and prone to errors; and (2) KT methods tend to overlook the semantics of both questions and the given KCs. In this work, we address these challenges and present KCQRL, a framework for automated knowledge concept annotation and question representation learning that can improve the effectiveness of any existing KT model. First, we propose an automated KC annotation process using large language models (LLMs), which generates question solutions and then annotates KCs in each solution step of the questions. Second, we introduce a contrastive learning approach to generate semantically rich embeddings for questions and solution steps, aligning them with their associated KCs via a tailored false negative elimination approach. These embeddings can be readily integrated into existing KT models, replacing their randomly initialized embeddings. We demonstrate the effectiveness of KCQRL across 15 KT models on two large real-world Math learning datasets, where we achieve consistent performance improvements. The recent years have witnessed a surge in online learning platforms (Adedoyin & Soykan, 2023; Gros & García-Peñalvo, 2023), where students learn new knowledge concepts, which are then tested through exercises. Needless to say, personalization is crucial for effective learning: it allows that new knowledge concepts are carefully tailored to the current knowledge state of the student, which is more effective than one-size-fits-all approaches to learning (Cui & Sachan, 2023; Xu et al., 2024). However, such personalization requires that the knowledge of students is continuously assessed, which highlights the need for knowledge tracing (KT). In KT, one models the temporal dynamics of students' learning processes (Corbett & Anderson, 1994) in terms of a core set of skills, which are called knowledge concepts (KCs). KT models are typically time-series models that receive the past interactions of the learner as input (e.g., her previous exercises) in order to predict response of the learner to the next exercise. Yet, existing KT models have two main limitations that hinder their applicability in practice (see Figure 1). They require a comprehensive mapping between KCs and questions, which is typically done through manual annotations by experts. However, such KC annotation is both time-intensive and prone to errors (Clark, 2014; Bier et al., 2019). KT models overlook the semantics of both questions and KCs.
- Europe > Switzerland > Zürich > Zürich (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Instructional Material (0.93)
- Workflow (0.67)
- Research Report (0.64)
- Education > Curriculum > Subject-Specific Education (0.48)
- Education > Educational Setting > Online (0.48)
- Education > Educational Technology > Educational Software (0.46)
AutoTutor meets Large Language Models: A Language Model Tutor with Rich Pedagogy and Guardrails
Chowdhury, Sankalan Pal, Zouhar, Vilém, Sachan, Mrinmaya
Large Language Models (LLMs) have found several use cases in education, ranging from automatic question generation to essay evaluation. In this paper, we explore the potential of using Large Language Models (LLMs) to author Intelligent Tutoring Systems. A common pitfall of LLMs is their straying from desired pedagogical strategies such as leaking the answer to the student, and in general, providing no guarantees. We posit that while LLMs with certain guardrails can take the place of subject experts, the overall pedagogical design still needs to be handcrafted for the best learning results. Based on this principle, we create a sample end-to-end tutoring system named MWPTutor, which uses LLMs to fill in the state space of a pre-defined finite state transducer. This approach retains the structure and the pedagogy of traditional tutoring systems that has been developed over the years by learning scientists but brings in additional flexibility of LLM-based approaches. Through a human evaluation study on two datasets based on math word problems, we show that our hybrid approach achieves a better overall tutoring score than an instructed, but otherwise free-form, GPT-4. MWPTutor is completely modular and opens up the scope for the community to improve its performance by improving individual modules or using different teaching strategies that it can follow.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- (9 more...)
- Workflow (1.00)
- Research Report > New Finding (0.46)
Assertion Enhanced Few-Shot Learning: Instructive Technique for Large Language Models to Generate Educational Explanations
Shahriar, Tasmia, Ramos, Kelly, Matsuda, Noboru
Human educators possess an intrinsic ability to anticipate and seek educational explanations from students, which drives them to pose thought-provoking questions when students cannot articulate these explanations independently. We aim to imbue Intelligent Tutoring Systems with this ability using few-shot learning capability of Large Language Models. Our work proposes a novel prompting technique, Assertion Enhanced Few-Shot Learning, to facilitate the generation of accurate, detailed oriented educational explanations. Our central hypothesis is that, in educational domain, few-shot demonstrations are necessary but not a sufficient condition for quality explanation generation. We conducted a study involving 12 in-service teachers, comparing our approach to Traditional Few-Shot Learning. The results show that Assertion Enhanced Few-Shot Learning improves explanation accuracy by 15% and yields higher-quality explanations, as evaluated by teachers. We also conduct a qualitative ablation study to factor the impact of assertions to provide educator-friendly prompting guidelines for generating explanations in their domain of interest.
- North America > United States > North Carolina > Wake County > Raleigh (0.04)
- Europe > Switzerland (0.04)
Interpretable Math Word Problem Solution Generation Via Step-by-step Planning
Zhang, Mengxue, Wang, Zichao, Yang, Zhichao, Feng, Weiqi, Lan, Andrew
Solutions to math word problems (MWPs) with step-by-step explanations are valuable, especially in education, to help students better comprehend problem-solving strategies. Most existing approaches only focus on obtaining the final correct answer. A few recent approaches leverage intermediate solution steps to improve final answer correctness but often cannot generate coherent steps with a clear solution strategy. Contrary to existing work, we focus on improving the correctness and coherence of the intermediate solutions steps. We propose a step-by-step planning approach for intermediate solution generation, which strategically plans the generation of the next solution step based on the MWP and the previous solution steps. Our approach first plans the next step by predicting the necessary math operation needed to proceed, given history steps, then generates the next step, token-by-token, by prompting a language model with the predicted math operation. Experiments on the GSM8K dataset demonstrate that our approach improves the accuracy and interpretability of the solution on both automatic metrics and human evaluation.
- North America > Dominican Republic (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- (3 more...)
- Workflow (0.68)
- Research Report (0.64)
Learning to Reason With Relational Abstractions
Nam, Andrew J., Ren, Mengye, Finn, Chelsea, McClelland, James L.
Large language models have recently shown promising progress in mathematical reasoning when fine-tuned with human-generated sequences walking through a sequence of solution steps. However, the solution sequences are not formally structured and the resulting model-generated sequences may not reflect the kind of systematic reasoning we might expect an expert human to produce. In this paper, we study how to build stronger reasoning capability in language models using the idea of relational abstractions. We introduce new types of sequences that more explicitly provide an abstract characterization of the transitions through intermediate solution steps to the goal state. We find that models that are supplied with such sequences as prompts can solve tasks with a significantly higher accuracy, and models that are trained to produce such sequences solve problems better than those that are trained with previously used human-generated sequences and other baselines. Our work thus takes several steps toward elucidating and improving how language models perform on tasks requiring multi-step mathematical reasoning.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Maryland > Baltimore (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)