mathematics education
An Ontology-Based Approach to Optimizing Geometry Problem Sets for Skill Development
Bouzinier, Michael, Trifonov, Sergey, Chen, Matthew, Venkatesh, Tarun, Rifkin, Lielle
Euclidean geometry has historically played a central role in cultivating logical reasoning and abstract thinking within mathematics education, but has experienced waning emphasis in recent curricula. The resurgence of interest, driven by advances in artificial intelligence and educational technology, has highlighted geometry's potential to develop essential cognitive skills and inspired new approaches to automated problem solving and proof verification. This article presents an ontology-based framework for annotating and optimizing geometry problem sets, originally developed in the 1990s. The ontology systematically classifies geometric problems, solutions, and associated skills into interlinked facts, objects, and methods, supporting granular tracking of student abilities and facilitating curriculum design. The core concept of 'solution graphs'--directed acyclic graphs encoding multiple solution pathways and skill dependencies--enables alignment of problem selection with instructional objectives. We hypothesize that this framework also points toward automated solution validation via semantic parsing. We contend that our approach addresses longstanding challenges in representing dynamic, procedurally complex mathematical knowledge, paving the way for adaptive, feedback-rich educational tools. Our methodology offers a scalable, adaptable foundation for future advances in intelligent geometry education and automated reasoning.
Large Language Models Don't Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective
Strohmaier, Anselm R., Van Dooren, Wim, Seßler, Kathrin, Greer, Brian, Verschaffel, Lieven
Preprint August 2025 - This version has not been peer - reviewed . Abstract The progress of Large Language Models (LLMs) like ChatGPT raises the question of how they can be integrated into education. One hope is that they can support mathematics learning, including word - problem solving. Since LLMs can handle textual input with ease, they appear well - suited for solving mathematical word problems. Yet their real competence, whether they can make sense of the real - world context, and the implications for classrooms remain unclear. We conducted a scoping review from a mathematics - education perspective, including three parts: a technical overview, a systematic review of word problems used in research, and a state - of - the - art empirical evaluation of LLMs on mathematical word problems. First, in the technical overview, we contrast the conceptualization of word problems and their solution processes between LLMs and students. In computer - science research this is typically labeled mathematical reasoning, a term that does not align with usage in mathematics education. Second, our literature review of 213 studies shows that the most popular word - problem corpora are dominated by s - problems, which do not require a consideration of realities of their real - world context. Finally, our evaluation of GPT - 3.5 - turbo, GPT - 4o - mini, GPT - 4.1, o3, and GPT - 5 on 287 word problems shows that most recent LLMs solve these s - problems with near - perfect accuracy, including a perfect score on 2 0 problems from PISA. LLMs still showed weaknesses in tackling problems where the real - world context is problematic or non - sensical. In sum, we argue based on all three aspects that LLMs have mastered a superficial solution process but do not make sense of word problems, which potentially limits their value as instructional tools in mathematics classroom s. Keywords LLM; word - problem solving; AI; mathematical reasoning; modelling 1 Introduction In the last couple of years, the rapid improvement of Large Language Models (LLMs) has led to an unprecedented interest in educational research in artificial intelligence in general, and of LLMs in particular (Kasneci et al., 2023) . However, while LLMs excel at producing, translating and reviewing text, they are not natively designed for processing numerical information, calculating, or proving (Chang et al., 2024) . C ompared to other tasks, solving mathematical problems is relatively difficult for LLMs (Testolin, 2024) . This is also true for mathematical word - problems solving.
Evaluation of LLMs for mathematical problem solving
Wang, Ruonan, Wang, Runxi, Shen, Yunwen, Wu, Chengfeng, Zhou, Qinglin, Chandra, Rohitash
Large Language Models (LLMs) have shown impressive performance on a range of educational tasks, but are still understudied for their potential to solve mathematical problems. In this study, we compare three prominent LLMs, including GPT-4o, DeepSeek-V3, and Gemini-2.0, on three mathematics datasets of varying complexities (GSM8K, MATH500, and MIT Open Courseware datasets). We take a five-dimensional approach based on the Structured Chain-of-Thought (SCoT) framework to assess final answer correctness, step completeness, step validity, intermediate calculation accuracy, and problem comprehension. The results show that GPT-4o is the most stable and consistent in performance across all the datasets, but particularly it performs outstandingly in high-level questions of the MIT Open Courseware dataset. DeepSeek-V3 is competitively strong in well-structured domains such as optimisation, but suffers from fluctuations in accuracy in statistical inference tasks. Gemini-2.0 shows strong linguistic understanding and clarity in well-structured problems but performs poorly in multi-step reasoning and symbolic logic. Our error analysis reveals particular deficits in each model: GPT-4o is at times lacking in sufficient explanation or precision; DeepSeek-V3 leaves out intermediate steps; and Gemini-2.0 is less flexible in mathematical reasoning in higher dimensions.
The Use of Generative Artificial Intelligence for Upper Secondary Mathematics Education Through the Lens of Technology Acceptance
Setälä, Mika, Heilala, Ville, Sikström, Pieta, Kärkkäinen, Tommi
This study investigated the students' perceptions of using Generative Artificial Intelligence (GenAI) in upper-secondary mathematics education. Data was collected from Finnish high school students to represent how key constructs of the Technology Acceptance Model (Perceived Usefulness, Perceived Ease of Use, Perceived Enjoyment, and Intention to Use) influence the adoption of AI tools. First, a structural equation model for a comparative study with a prior study was constructed and analyzed. Then, an extended model with the additional construct of Compatibility, which represents the alignment of AI tools with students' educational experiences and needs, was proposed and analyzed. The results demonstrated a strong influence of perceived usefulness on the intention to use GenAI, emphasizing the statistically significant role of perceived enjoyment in determining perceived usefulness and ease of use. The inclusion of compatibility improved the model's explanatory power, particularly in predicting perceived usefulness. This study contributes to a deeper understanding of how AI tools can be integrated into mathematics education and highlights key differences between the Finnish educational context and previous studies based on structural equation modeling.
Solving with GeoGebra Discovery an Austrian Mathematics Olympiad problem: Lessons Learned
Ariño-Morera, Belén, Kovács, Zoltán, Recio, Tomás, Tolmos, Piedad
We address, through the automated reasoning tools in GeoGebra Discovery, a problem from a regional phase of the Austrian Mathematics Olympiad 2023. Trying to solve this problem gives rise to four different kind of feedback: the almost instantaneous, automated solution of the proposed problem; the measure of its complexity, according to some recent proposals; the automated discovery of a generalization of the given assertion, showing that the same statement is true over more general polygons than those mentioned in the problem; and the difficulties associated to the analysis of the surprising and involved high number of degenerate cases that appear when using the LocusEquation command in this problem. In our communication we will describe and reflect on these diverse issues, enhancing its exemplar role for showing some of the advantages, problems, and current fields of development of GeoGebra Discovery.
Showing Proofs, Assessing Difficulty with GeoGebra Discovery
Kovács, Zoltán, Recio, Tomás, Vélez, M. Pilar
See [1] for a general description and references. The goal of the current contribution is to present some ongoing work regarding two different, but related, important improvements of GeoGebra Discovery. One, to visualize the different steps that GG Discovery performs with a given geometric statement until it declares its truth (or failure). Two, to test, through different elementary examples, the suitability of an original proposal to evaluate the interest, complexity or difficulty of a given statement. Let us advance that our proposal involves the notion of syzygy of a set of polynomials. The relevance of showing details about each of the steps performed by our automated reasoning algorithms implemented in GG Discovery is quite evident. In fact, as a consequence of the result in [2], describing the formalization of the arithmetization of Euclidean plane geometry, proofs of geometric statements obtained using algebraic geometry algorithms are also valid on the synthetic geometry realm.
Computer Assisted Proofs and Automated Methods in Mathematics Education
This survey paper is an expanded version of an invited keynote at the ThEdu'22 workshop, August 2022, in Haifa (Israel). After a short introduction on the developments of CAS, DGS and other useful technologies, we show implications in Mathematics Education, and in the broader frame of STEAM Education. In particular, we discuss the transformation of Mathematics Education into exploration-discovery-conjecture-proof scheme, avoiding usage as a black box . This scheme fits well into the so-called 4 C's of 21st Century Education. Communication and Collaboration are emphasized not only between humans, but also between machines, and between man and machine. Specific characteristics of the outputs enhance the need of Critical Thinking. The usage of automated commands for exploration and discovery is discussed, with mention of limitations where they exist. We illustrate the topic with examples from parametric integrals (describing a "cognitive neighborhood" of a mathematical notion), plane geometry, and the study of plane curves (envelopes, isoptic curves). Some of the examples are fully worked out, others are explained and references are given.
A Classification of Artificial Intelligence Systems for Mathematics Education
Van Vaerenbergh, Steven, Pérez-Suay, Adrián
This chapter provides an overview of the different Artificial Intelligence (AI) systems that are being used in contemporary digital tools for Mathematics Education (ME). It is aimed at researchers in AI and Machine Learning (ML), for whom we shed some light on the specific technologies that are being used in educational applications; and at researchers in ME, for whom we clarify: i) what the possibilities of the current AI technologies are, ii) what is still out of reach and iii) what is to be expected in the near future. We start our analysis by establishing a high-level taxonomy of AI tools that are found as components in digital ME applications. Then, we describe in detail how these AI tools, and in particular ML, are being used in two key applications, specifically AI-based calculators and intelligent tutoring systems. We finish the chapter with a discussion about student modeling systems and their relationship to artificial general intelligence.