Denain, Jean-Stanislas
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
Glazer, Elliot, Erdil, Ege, Besiroglu, Tamay, Chicharro, Diego, Chen, Evan, Gunning, Alex, Olsson, Caroline Falkman, Denain, Jean-Stanislas, Ho, Anson, Santos, Emily de Oliveira, Järviniemi, Olli, Barnett, Matthew, Sandler, Robert, Vrzala, Matej, Sevilla, Jaime, Ren, Qiuyu, Pratt, Elizabeth, Levine, Lionel, Barkley, Grant, Stewart, Natalie, Grechuk, Bogdan, Grechuk, Tetiana, Enugandla, Shreepranav Varma, Wildon, Mark
Recent AI systems have demonstrated remarkable proficiency in tackling challenging mathematical tasks, from achieving olympiad-level performance in geometry (Trinh et al. 2024) to improving upon existing research results in combinatorics (Romera-Paredes et al. 2024). However, existing benchmarks face some limitations: Saturation of existing benchmarks Current standard mathematics benchmarks such as the MATH dataset (Hendrycks, Burns, Kadavath, et al. 2021) and GSM8K (Cobbe et al. 2021) primarily assess competency at the high-school and early undergraduate level. As state-of-the-art models achieve near-perfect performance on these benchmarks, we lack rigorous ways to evaluate their capabilities in advanced mathematical domains that require deeper theoretical understanding, creative insight, and specialized expertise. Furthermore, to assess AI's potential contributions to mathematics research, we require benchmarks that better reflect the challenges faced by working mathematicians. Benchmark contamination in training data A significant challenge in evaluating large language models (LLMs) is data contamination--the inadvertent inclusion of benchmark problems in training data.
AI capabilities can be significantly improved without expensive retraining
Davidson, Tom, Denain, Jean-Stanislas, Villalobos, Pablo, Bas, Guillem
State-of-the-art AI systems can be significantly improved without expensive retraining via "post-training enhancements"-techniques applied after initial training like fine-tuning the system to use a web browser. We review recent post-training enhancements, categorizing them into five types: tool-use, prompting methods, scaffolding, solution selection, and data generation. Different enhancements improve performance on different tasks, making it hard to compare their significance. So we translate improvements from different enhancements into a common currency, the compute-equivalent gain: how much additional training compute would be needed to improve performance by the same amount as the enhancement. Our non-experimental work shows that post-training enhancements have significant benefits: most surveyed enhancements improve benchmark performance by more than a 5x increase in training compute, some by more than 20x. Post-training enhancements are relatively cheap to develop: fine-tuning costs are typically <1% of the original training cost. Governing the development of capable post-training enhancements may be challenging because frontier models could be enhanced by a wide range of actors.
Overthinking the Truth: Understanding how Language Models Process False Demonstrations
Halawi, Danny, Denain, Jean-Stanislas, Steinhardt, Jacob
Modern language models can imitate complex patterns through few-shot learning, enabling them to complete challenging tasks without fine-tuning. However, imitation can also lead models to reproduce inaccuracies or harmful content if present in the context. We study harmful imitation through the lens of a model's internal representations, and identify two related phenomena: overthinking and false induction heads. The first phenomenon, overthinking, appears when we decode predictions from intermediate layers, given correct vs. incorrect few-shot demonstrations. At early layers, both demonstrations induce similar model behavior, but the behavior diverges sharply at some "critical layer", after which the accuracy given incorrect demonstrations progressively decreases. The second phenomenon, false induction heads, are a possible mechanistic cause of overthinking: these are heads in late layers that attend to and copy false information from previous demonstrations, and whose ablation reduces overthinking. Beyond scientific understanding, our results suggest that studying intermediate model computations could be a promising avenue for understanding and guarding against harmful model behaviors.
Auditing Visualizations: Transparency Methods Struggle to Detect Anomalous Behavior
Denain, Jean-Stanislas, Steinhardt, Jacob
Model visualizations provide information that outputs alone might miss. But can we trust that model visualizations reflect model behavior? For instance, can they diagnose abnormal behavior such as planted backdoors or overregularization? To evaluate visualization methods, we test whether they assign different visualizations to anomalously trained models and normal models. We find that while existing methods can detect models with starkly anomalous behavior, they struggle to identify more subtle anomalies. Moreover, they often fail to recognize the inputs that induce anomalous behavior, e.g. images containing a spurious cue. These results reveal blind spots and limitations of some popular model visualizations. By introducing a novel evaluation framework for visualizations, our work paves the way for developing more reliable model transparency methods in the future.
MetFlow: A New Efficient Method for Bridging the Gap between Markov Chain Monte Carlo and Variational Inference
Thin, Achille, Kotelevskii, Nikita, Denain, Jean-Stanislas, Grinsztajn, Leo, Durmus, Alain, Panov, Maxim, Moulines, Eric
In this contribution, we propose a new computationally efficient method to combine Variational Inference (VI) with Markov Chain Monte Carlo (MCMC). This approach can be used with generic MCMC kernels, but is especially well suited to \textit{MetFlow}, a novel family of MCMC algorithms we introduce, in which proposals are obtained using Normalizing Flows. The marginal distribution produced by such MCMC algorithms is a mixture of flow-based distributions, thus drastically increasing the expressivity of the variational family. Unlike previous methods following this direction, our approach is amenable to the reparametrization trick and does not rely on computationally expensive reverse kernels. Extensive numerical experiments show clear computational and performance improvements over state-of-the-art methods.