Large Language Models Don't Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective
Strohmaier, Anselm R., Van Dooren, Wim, Seßler, Kathrin, Greer, Brian, Verschaffel, Lieven
–arXiv.org Artificial Intelligence
Preprint August 2025 - This version has not been peer - reviewed . Abstract The progress of Large Language Models (LLMs) like ChatGPT raises the question of how they can be integrated into education. One hope is that they can support mathematics learning, including word - problem solving. Since LLMs can handle textual input with ease, they appear well - suited for solving mathematical word problems. Yet their real competence, whether they can make sense of the real - world context, and the implications for classrooms remain unclear. We conducted a scoping review from a mathematics - education perspective, including three parts: a technical overview, a systematic review of word problems used in research, and a state - of - the - art empirical evaluation of LLMs on mathematical word problems. First, in the technical overview, we contrast the conceptualization of word problems and their solution processes between LLMs and students. In computer - science research this is typically labeled mathematical reasoning, a term that does not align with usage in mathematics education. Second, our literature review of 213 studies shows that the most popular word - problem corpora are dominated by s - problems, which do not require a consideration of realities of their real - world context. Finally, our evaluation of GPT - 3.5 - turbo, GPT - 4o - mini, GPT - 4.1, o3, and GPT - 5 on 287 word problems shows that most recent LLMs solve these s - problems with near - perfect accuracy, including a perfect score on 2 0 problems from PISA. LLMs still showed weaknesses in tackling problems where the real - world context is problematic or non - sensical. In sum, we argue based on all three aspects that LLMs have mastered a superficial solution process but do not make sense of word problems, which potentially limits their value as instructional tools in mathematics classroom s. Keywords LLM; word - problem solving; AI; mathematical reasoning; modelling 1 Introduction In the last couple of years, the rapid improvement of Large Language Models (LLMs) has led to an unprecedented interest in educational research in artificial intelligence in general, and of LLMs in particular (Kasneci et al., 2023) . However, while LLMs excel at producing, translating and reviewing text, they are not natively designed for processing numerical information, calculating, or proving (Chang et al., 2024) . C ompared to other tasks, solving mathematical problems is relatively difficult for LLMs (Testolin, 2024) . This is also true for mathematical word - problems solving.
arXiv.org Artificial Intelligence
Aug-12-2025
- Country:
- Europe
- Belgium > Flanders
- Flemish Brabant > Leuven (0.04)
- France > Auvergne-Rhône-Alpes
- Germany
- Baden-Württemberg > Freiburg (0.04)
- Bavaria > Upper Bavaria
- Munich (0.04)
- Berlin (0.04)
- Greece > Central Macedonia
- Thessaloniki (0.04)
- Middle East > Malta
- Eastern Region > Northern Harbour District > St. Julian's (0.04)
- Belgium > Flanders
- North America
- Canada > Ontario
- Toronto (0.04)
- United States > California
- San Diego County > San Diego (0.04)
- Canada > Ontario
- Europe
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Education
- Curriculum > Subject-Specific Education (0.76)
- Educational Setting > K-12 Education (0.67)
- Educational Technology (0.66)
- Education
- Technology: