AITopics | math question

Collaborating Authors

math question

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Multi-Agent Collaborative Framework For Math Problem Generation

Karbasi, Kia, Hong, Kevin, Samadi, Mohammad Amin, Pottie, Gregory

arXiv.org Artificial IntelligenceNov-7-2025

Automatic question generation (AQG) for mathematics education remains an elusive goal for Intelligent Tutoring Systems and educators. While pre-trained transformer-based language models have significantly advanced natural language generation, they often struggle to precisely control problem complexity and cognitive demands. In this paper, we introduce a collaborative multi-agent framework as a novel method of incorporating inference-time computation into AQG. This approach leverages multiple agents that it-eratively refine generated question-answer pairs to better balance complexity and cognitive demand. We evaluate the generated questions on five meta-evaluation criteria: relevance, importance, clarity, difficulty matching, answerability, to assess the system's ability to control the required complexity and quality of the questions. Preliminary evaluations show that this collaborative multi-agent framework elevates the quality of generated educational content by fostering a more nuanced balance between cognitive challenge and clarity. These promising outcomes suggest that integrating collaborative multi-agent workflows can yield more controlled, pedagogically valuable content that can help advance automated educational content generation and adaptive learning environments.

agent, artificial intelligence, natural language, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.5281/zenodo.15870246

2511.03958

Country: North America > United States > California > Los Angeles County > Los Angeles (0.29)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Industry: Education > Educational Technology > Educational Software > Computer Based Training (0.55)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.75)

Add feedback

Enhancing Math Learning in an LMS Using AI-Driven Question Recommendations

Råmunddal, Justus

arXiv.org Artificial IntelligenceApr-22-2025

This paper presents an AI-driven approach to enhance math learning in a modern Learning Management System (LMS) by recommending similar math questions. Deep embeddings for math questions are generated using Meta's Llama-3.2-11B-Vision-Instruct model, and three recommendation methods-cosine similarity, Self-Organizing Maps (SOM), and Gaussian Mixture Models (GMM)-are applied to identify similar questions. User interaction data, including session durations, response times, and correctness, are used to evaluate the methods. Our findings suggest that while cosine similarity produces nearly identical question matches, SOM yields higher user satisfaction whereas GMM generally underperforms, indicating that introducing variety to a certain degree may enhance engagement and thereby potential learning outcomes until variety is no longer balanced reasonably, which our data about the implementations of all three methods demonstrate.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2504.14098

Country: North America > United States (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Education > Educational Technology > Educational Software (0.68)
Education > Curriculum > Subject-Specific Education (0.60)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

The Jailbreak Tax: How Useful are Your Jailbreak Outputs?

Nikolić, Kristina, Sun, Luze, Zhang, Jie, Tramèr, Florian

arXiv.org Artificial IntelligenceApr-16-2025

Jailbreak attacks bypass the guardrails of large language models to produce harmful outputs. In this paper, we ask whether the model outputs produced by existing jailbreaks are actually useful. For example, when jailbreaking a model to give instructions for building a bomb, does the jailbreak yield good instructions? Since the utility of most unsafe answers (e.g., bomb instructions) is hard to evaluate rigorously, we build new jailbreak evaluation sets with known ground truth answers, by aligning models to refuse questions related to benign and easy-to-evaluate topics (e.g., biology or math). Our evaluation of eight representative jailbreaks across five utility benchmarks reveals a consistent drop in model utility in jailbroken responses, which we term the jailbreak tax. For example, while all jailbreaks we tested bypass guardrails in models aligned to refuse to answer math, this comes at the expense of a drop of up to 92% in accuracy. Overall, our work proposes the jailbreak tax as a new important metric in AI safety, and introduces benchmarks to evaluate existing and future jailbreaks. We make the benchmark available at https://github.com/ethz-spylab/jailbreak-tax

jailbreak tax, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2504.10694

Genre: Research Report > New Finding (0.69)

Industry:

Information Technology > Security & Privacy (1.00)
Law Enforcement & Public Safety (0.68)
Government > Military (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

Automating Tools for Prompt Engineering

Communications of the ACMApr-10-2025, 16:23:33 GMT

Generative artificial intelligence (GAI) started making waves a few years ago with the release of systems such as ChatGPT and DALL-E. They are able to produce sophisticated and human-like text, code, or images after the models powering them are trained on large quantities of data. However, it soon became apparent that the specific phrasing of a question or statement input by a user, known as a prompt, had an impact on the quality of the resulting output. "It's a way of unlocking different capabilities from these models," says Andrei Muresanu, an AI researcher at Vector Institute in Toronto, Canada. "If you tell ChatGPT to pretend that it's a professor of mathematics, it will do better on math questions than if you just say, 'answer this question' or'pretend you're a student'." Coming up with prompts that steer a model towards a desired output has emerged as a relatively new profession, called prompt engineering, to help achieve more relevant and accurate results.

large language model, machine learning, natural language, (18 more...)

Communications of the ACM

Country:

North America > Canada > Ontario > Toronto (0.25)
North America > United States > California (0.05)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.05)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.88)

Add feedback

Self-supervised Analogical Learning using Language Models

Zhou, Ben, Jain, Sarthak, Zhang, Yi, Ning, Qiang, Wang, Shuai, Benajiba, Yassine, Roth, Dan

arXiv.org Artificial IntelligenceFeb-2-2025

Large language models have been shown to suffer from reasoning inconsistency issues. That is, they fail more in situations unfamiliar to the training data, even though exact or very similar reasoning paths exist in more common cases that they can successfully solve. Such observations motivate us to propose methods that encourage models to understand the high-level and abstract reasoning processes during training instead of only the final answer. This way, models can transfer the exact solution to similar cases, regardless of their relevance to the pre-training data distribution. In this work, we propose SAL, a self-supervised analogical learning framework. SAL mimics the human analogy process and trains models to explicitly transfer high-quality symbolic solutions from cases that they know how to solve to other rare cases in which they tend to fail more. We show that the resulting models after SAL learning outperform base language models on a wide range of reasoning benchmarks, such as StrategyQA, GSM8K, and HotpotQA, by 2% to 20%. At the same time, we show that our model is more generalizable and controllable through analytical studies.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.00996

Country:

North America > United States > Pennsylvania (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
North America > United States > Arizona (0.04)
Asia > India > West Bengal > Kolkata (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling

Liu, Zihan, Chen, Yang, Shoeybi, Mohammad, Catanzaro, Bryan, Ping, Wei

arXiv.org Artificial IntelligenceDec-19-2024

In this paper, we introduce AceMath, a suite of frontier math models that excel in solving complex math problems, along with highly effective reward models capable of evaluating generated solutions and reliably identifying the correct ones. To develop the instruction-tuned math models, we propose a supervised fine-tuning (SFT) process that first achieves competitive performance across general domains, followed by targeted fine-tuning for the math domain using a carefully curated set of prompts and synthetically generated responses. The resulting model, AceMath-72B-Instruct greatly outperforms Qwen2.5-Math-72B-Instruct, GPT-4o and Claude-3.5 Sonnet. To develop math-specialized reward model, we first construct AceMath-RewardBench, a comprehensive and robust benchmark for evaluating math reward models across diverse problems and difficulty levels. After that, we present a systematic approach to build our math reward models. The resulting model, AceMath-72B-RM, consistently outperforms state-of-the-art reward models. Furthermore, when combining AceMath-72B-Instruct with AceMath-72B-RM, we achieve the highest average rm@8 score across the math reasoning benchmarks. We will release model weights, training data, and evaluation benchmarks at: https://research.nvidia.com/labs/adlr/acemath

large language model, machine learning, qwen2, (17 more...)

arXiv.org Artificial Intelligence

2412.15084

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)

Genre: Research Report (0.40)

Industry:

Education (0.46)
Health & Medicine > Consumer Health (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Improving Mathematical Reasoning Capabilities of Small Language Models via Feedback-Driven Distillation

Zhu, Xunyu, Li, Jian, Ma, Can, Wang, Weiping

arXiv.org Artificial IntelligenceNov-21-2024

Large Language Models (LLMs) demonstrate exceptional reasoning capabilities, often achieving state-of-the-art performance in various tasks. However, their substantial computational and memory demands, due to billions of parameters, hinder deployment in resource-constrained environments. A promising solution is knowledge distillation, where LLMs transfer reasoning capabilities to Small Language Models (SLMs, $\le$ 1B parameters), enabling wider deployment on low-resource devices. Existing methods primarily focus on generating high-quality reasoning rationales for distillation datasets but often neglect the critical role of data quantity and quality. To address these challenges, we propose a Feedback-Driven Distillation (FDD) framework to enhance SLMs' mathematical reasoning capabilities. In the initialization stage, a distillation dataset is constructed by prompting LLMs to pair mathematical problems with corresponding reasoning rationales. We classify problems into easy and hard categories based on SLM performance. For easy problems, LLMs generate more complex variations, while for hard problems, new questions of similar complexity are synthesized. In addition, we propose a multi-round distillation paradigm to iteratively enrich the distillation datasets, thereby progressively improving the mathematical reasoning abilities of SLMs. Experimental results demonstrate that our method can make SLMs achieve SOTA mathematical reasoning performance.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2411.14698

Country:

North America > Canada > Ontario > Toronto (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
North America > United States > Florida > Miami-Dade County > Miami (0.04)
(5 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning

Yu, Xiaodong, Zhou, Ben, Cheng, Hao, Roth, Dan

arXiv.org Artificial IntelligenceOct-24-2024

Existing math datasets evaluate the reasoning abilities of large language models (LLMs) by either using the final answer or the intermediate reasoning steps derived from static examples. However, the former approach fails to surface model's uses of shortcuts and wrong reasoning while the later poses challenges in accommodating alternative solutions. In this work, we seek to use symbolic programs as a means for automated evaluation if a model can consistently produce correct final answers across various inputs to the program. We begin by extracting programs for popular math datasets (GSM8K and MATH) using GPT4-o. For those executable programs verified using the original input-output pairs, they are found to encapsulate the proper reasoning required to solve the original text questions. We then prompt GPT4-o to generate new questions using alternative input-output pairs based the extracted program. We apply the resulting datasets to evaluate a collection of LLMs. In our experiments, we observe significant accuracy drops using our proposed evaluation compared with original static examples, suggesting the fragility of math reasoning in state-of-the-art LLMs.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2410.19056

Country:

North America > United States > Pennsylvania (0.04)
North America > United States > Arizona (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
(4 more...)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)

Add feedback

Learning to Love Edge Cases in Formative Math Assessment: Using the AMMORE Dataset and Chain-of-Thought Prompting to Improve Grading Accuracy

Henkel, Owen, Horne-Robinson, Hannah, Dyshel, Maria, Ch, Nabil, Moreau-Pernet, Baptiste, Abood, Ralph

arXiv.org Artificial IntelligenceSep-26-2024

This paper introduces AMMORE, a new dataset of 53,000 math open-response question-answer pairs from Rori, a learning platform used by students in several African countries and conducts two experiments to evaluate the use of large language models (LLM) for grading particularly challenging student answers. The AMMORE dataset enables various potential analyses and provides an important resource for researching student math acquisition in understudied, real-world, educational contexts. In experiment 1 we use a variety of LLM-driven approaches, including zero-shot, few-shot, and chain-of-thought prompting, to grade the 1% of student answers that a rule-based classifier fails to grade accurately. We find that the best-performing approach -- chain-of-thought prompting -- accurately scored 92% of these edge cases, effectively boosting the overall accuracy of the grading from 98.7% to 99.9%. In experiment 2, we aim to better understand the consequential validity of the improved grading accuracy, by passing grades generated by the best-performing LLM-based approach to a Bayesian Knowledge Tracing (BKT) model, which estimated student mastery of specific lessons. We find that relatively modest improvements in model accuracy at the individual question level can lead to significant changes in the estimation of student mastery. Where the rules-based classifier currently used to grade student, answers misclassified the mastery status of 6.9% of students across their completed lessons, using the LLM chain-of-thought approach this misclassification rate was reduced to 2.6% of students. Taken together, these findings suggest that LLMs could be a valuable tool for grading open-response questions in K-12 mathematics education, potentially enabling encouraging wider adoption of open-ended questions in formative assessment.

dataset, evaluation, student, (15 more...)

arXiv.org Artificial Intelligence

2409.17904

Country:

Africa > West Africa (0.04)
Africa > Ghana (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(6 more...)

Genre:

Research Report > New Finding (0.48)
Instructional Material > Course Syllabus & Notes (0.46)

Industry:

Education > Assessment & Standards > Student Performance (1.00)
Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.46)
Education > Educational Technology > Educational Software > Computer Based Training (0.46)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Cross-Language Assessment of Mathematical Capability of ChatGPT

Sathe, Gargi, Shamraj, Aneesh, Surve, Aditya, Patil, Nahush, Saxena, Kumkum

arXiv.org Artificial IntelligenceMay-18-2024

This paper presents an evaluation of the mathematical capability of ChatGPT across diverse languages like Hindi, Gujarati, and Marathi. ChatGPT, based on GPT-3.5 by OpenAI, has garnered significant attention for its natural language understanding and generation abilities. However, its performance in solving mathematical problems across multiple natural languages remains a comparatively unexplored area, especially in regional Indian languages. In this paper, we explore those capabilities as well as using chain-of-thought prompting to figure out if it increases the accuracy of responses as much as it does in the English language and provide insights into the current limitations.

chatgpt, gujarati, marathi, (15 more...)

arXiv.org Artificial Intelligence

2405.11264

Country:

Asia > India > Maharashtra > Mumbai (0.05)
Asia > India > Gujarat (0.04)

Genre: Research Report (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback