Goto

Collaborating Authors

 math



Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving

Neural Information Processing Systems

Today's best LLMs clearly possess some reasoning processes. The paper gives evidence that they also have metacognitive knowledge, including ability to name skills and procedures to apply given a task. We explore this primarily in context of math reasoning, developing a prompt-guided interaction procedure to get a powerful LLM to assign sensible skill labels to math questions, followed by having it perform semantic clustering to obtain coarser families of skill labels. These coarse skill labels look interpretable to humans.To validate that these skill labels are meaningful and relevant to the LLM's reasoning processes we perform the following experiments.


ReCode: Updating Code API Knowledge with Reinforcement Learning

Wu, Haoze, Yao, Yunzhi, Yu, Wenhao, Zhang, Ningyu

arXiv.org Artificial Intelligence

Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs' code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs' general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at https://github.com/zjunlp/ReCode.


A Data and

Neural Information Processing Systems

We tried removing and keeping the comments in the code from our training data. As shown in Table 6, keeping the comments gives better results overall. Detailed statistics of the resulting dataset can be found in Table 3. We give the size in GigaBytes, the number of files and functions, and the number of tokens. We show two versions of the same Python function and their common tokenization.





Results for non-convex: Some reviewers mention the lack of results for non-convex settings as a weakness of the

Neural Information Processing Systems

We thank the reviewers for their comments. Paper structure: Reviewer 4 raised the concern that "the paper and the math seem unconnected at times". The proof for each has the same structure viz. Due to similarity the proofs are bundled in Appendix A. Due to similarity the proofs are bundled in Appendix B. Finally the proofs of the main theorems are bundled in Appendix C. We appreciate the positive feedback, and the pointer to confusing notation. Average iterate as opposed to last iterate is indeed done to produce a simple and generalizable analysis.


From Euler to Today: Universal Mathematical Fallibility A Large-Scale Computational Analysis of Errors in ArXiv Papers

Rivin, Igor

arXiv.org Artificial Intelligence

We present the results of a large-scale computational analysis of mathematical papers from the ArXiv repository, demonstrating a comprehensive system that not only detects mathematical errors but provides complete referee reports with journal tier recommendations. Our automated analysis system processed over 37,000 papers across multiple mathematical categories, revealing significant error rates and quality distributions. Remarkably, the system identified errors in papers spanning three centuries of mathematics, including seven works by Leonhard Euler (1707-1783) in just 403 papers analyzed from the History category, as well as errors by Peter Gustav Lejeune Dirichlet (1805-1859) and contemporary Fields medalists. In Dynamical Systems (math.DS), we observed the highest error rate of 11.4% (2,347 errors in 20,666 papers), while Numerical Analysis (math.NA) showed 9.6% (2,271 errors in 23,761 papers). History and Overview (math.HO) exhibited 13.6% errors in preliminary analysis, including seven papers by Euler. In contrast, Geometric Topology (math.GT) showed 3.6% and Category Theory (math.CT) exhibited the lowest rate at 6.1% (228 errors in 3,720 papers). Beyond error detection, the system evaluated papers for journal suitability, recommending 0.4% for top generalist journals, 15.5% for top field-specific journals, and categorizing the remainder across specialist venues. These findings demonstrate both the universality of mathematical error across all eras and the feasibility of automated comprehensive mathematical peer review at scale. This work demonstrates that the methodology, while applied here to mathematics, is discipline-agnostic and could be readily extended to physics, computer science, and other fields represented in the ArXiv repository.


A Appendix

Neural Information Processing Systems

A.1 T ensorFlow Primitives V ocabulary Name TF Function Argument Mapping Input 1 Input 2 Constant Dim Size ADD tf.math.add "Name" is the name of the operation in our search "TF Function" is the TensorFlow function that the name is mapped to when a DNA instruction "Argument Mapping" describes how the values in a DNA's argument set are mapped to the corresponding TensorFlow function arguments. TensorFlow graphs are built from DNA programs as described in Section 2 of the main text. The vocabulary for these relative dimensions is [1, 2, 4, 8, 12, 16, 24, 32, 48, 64]. This vocabulary was not tuned.