AITopics | Education

Gradient-based methods are the primary approach for training ne ural networks. In recent years, research in learning theory has shown that neural networks can efficiently lea rn various data classes using empirical risk minimization (ERM) methods. In many real-world settings, data a rrive sequentially in a non-stationary manner, requiring the learner to maintain performance on past tas ks while acquiring new capabilities. In such cases, a learning model must be continually learnable, meaning it should retain previously acquired knowledge when trained on new tasks. On the other hand, various le arning systems, including deep learning architectures, can be prone to catastrophic forgetting, that is, updating a model on new data causes a dramatic drop in performance on previously learned tasks [ McCloskey and Cohen, 1989, Goodfellow et al., 2013 ]. The goal of continual (lifelong) learning is to develop models and methods that, even without retraining on old data, experience minimal forgetting when incorporating new inform ation. Despite deep learning's ubiquity, characterizing the power and limitat ions of neural networks is still an ongoing research direction. While several recent works aim to unde rstand the power of gradient descent (GD) for training neural networks with stylized data distributions, these works are still limited to single-task scenarios (for some examples see [ Du et al., 2019, Bartlett et al., 2021, Abbe et al., 2022 ]).

continual learning, neural network, training loss, (11 more...)

arXiv.org Machine Learning

2510.05573

Country: North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Education > Educational Setting (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

MatheMagic: Generating Dynamic Mathematics Benchmarks Robust to Memorization

O'Brien, Dayyán, Haddow, Barry, Allaway, Emily, Chen, Pinzhen

arXiv.org Artificial IntelligenceOct-8-2025

Conducting contamination-free evaluation of mathematical capabilities can be difficult for two reasons: models may memorize a test set once it is made public, and current mathematical benchmarks are prone to overfitting due to having limited diversity of symbols and rules, coupled with closed-ended answers. This paper proposes a method to leverage these shortcomings as useful features to a construct dynamic, counterfactual benchmark, which can be used to both reveal overfitting and measure true reasoning. We demonstrate this via MatheMagic, which generates math test instances with the interpretations of numbers and operators altered, yet has automatically verifiable answers. Test instances are randomly seeded and constructed at test time to evaluate a model's induction or deduction capability, offering stability, extensibility, comparability, and robustness to overfitting. Our experiments find that models solve deduction more easily than induction, but they revert to standard math. Further analysis reveals that math-adapted models fail to exhibit a general "skill" of reasoning, and fine-tuning on induction tasks generalizes poorly.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2510.05962

Genre: Research Report (0.50)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.99)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.32)

Add feedback

From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning

Huang, Yuzhen, Zeng, Weihao, Zeng, Xingshan, Zhu, Qi, He, Junxian

arXiv.org Artificial IntelligenceOct-8-2025

Trustworthy verifiers are essential for the success of reinforcement learning with verifiable reward (RLVR), which is the core methodology behind various large reasoning models such as DeepSeek-R1. In complex domains like mathematical reasoning, rule-based verifiers have been widely adopted in previous works to train strong reasoning models. However, the reliability of these verifiers and their impact on the RL training process remain poorly understood. In this work, we take mathematical reasoning as a case study and conduct a comprehensive analysis of various verifiers in both static evaluation and RL training scenarios. First, we find that current open-source rule-based verifiers often fail to recognize equivalent answers presented in different formats across multiple commonly used mathematical datasets, resulting in non-negligible false negative rates. This limitation adversely affects RL training performance and becomes more pronounced as the policy model gets stronger. Subsequently, we investigate model-based verifiers as a potential solution to address these limitations. While the static evaluation shows that model-based verifiers achieve significantly higher verification accuracy, further analysis and RL results imply that they are highly susceptible to hacking, where they misclassify certain patterns in responses as correct, particularly after fine-tuning. This vulnerability is exploited during policy model optimization, leading to artificially inflated rewards. Our findings underscore the unique challenges inherent to both rule-based and model-based verifiers and provide insights toward developing more accurate and robust reward systems for reinforcement learning.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.22203

Genre: Research Report > New Finding (1.00)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)

Add feedback