Goto

Collaborating Authors

 Yang, Zitong


s1: Simple test-time scaling

arXiv.org Artificial Intelligence

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1


Bellman Conformal Inference: Calibrating Prediction Intervals For Time Series

arXiv.org Artificial Intelligence

Uncertainty quantification for time series nowcasting and forecasting is crucial in many areas such as climate science, epidemiology, industrial engineering, and macroeconomics. Ideally, the forecaster would generate a prediction interval at each time period that is calibrated in the sense that the fraction of intervals covering the true outcomes is approximately equal to the target coverage level in the long run. Classical approaches for generating prediction intervals are mostly model-based Box and Jenkins [1976], Engle [1982a], Stock and Watson [2010], Brown [1964], Jorda [2005]. However, time series models are often mis-specified due to nonstationarity or changing environments. As a result, the model-based prediction intervals tend to be poorly calibrated (see for instance the gray curves in Figure 1). Moreover, many forecasters have upgraded their workflows by incorporating black-box machine learning algorithms [e.g. Taylor and Letham, 2018, Makridakis et al., 2018, Herzen et al., 2022], for which valid uncertainty quantification proves to be challenging.


ResMem: Learn what you can and memorize the rest

arXiv.org Machine Learning

The impressive generalization performance of modern neural networks is attributed in part to their ability to implicitly memorize complex training patterns. Inspired by this, we explore a novel mechanism to improve model generalization via explicit memorization. Specifically, we propose the residual-memorization (ResMem) algorithm, a new method that augments an existing prediction model (e.g., a neural network) by fitting the model's residuals with a k-nearest neighbor based regressor. The final prediction is then the sum of the original model and the fitted residual regressor. By construction, ResMem can explicitly memorize the training labels, even when the base model has low capacity. We start by formulating a stylized linear regression problem and rigorously show that ResMem results in a more favorable test risk over a base linear neural network. Then, we empirically show that ResMem consistently improves the test set generalization of the original prediction model across standard vision and natural language processing benchmarks.


Understanding Generalization in Adversarial Training via the Bias-Variance Decomposition

arXiv.org Machine Learning

Adversarial training enhances the robustness of deep neural networks at the cost of decreased accuracy on the clean test samples [Goodfellow et al., 2014, Madry et al., 2017, Sinha et al., 2017]. Though the model can fit the training data perfectly in adversarial training, the generalization error on clean test dataset increases compared with non-adversarially trained models. For example, in the rightmost panel of Figure 1(b), we can see that, even if an adversarially trained model achieves almost zero error on the clean training data (up to a certain level of perturbation ε), the error on the clean test data (the blue curve) keeps increasing with ε. Hence to improve both robustness and accuracy of (adversarially trained) deep networks, it is crucial to understand the cause for this increased "generalization gap" between errors on the (clean) training dataset and (clean) test dataset. In this work, to better understand the generalization gap, we turn to a standard tool of statistical learning theory, the bias-variance decomposition [Markov, 1900, Lehmann, 1983, Casella and Berger, 1990, Hastie et al., 2009, Geman et al., 1992]. A large variance corresponds to the instability of the model, whereas a large bias suggests that the model predicts poorly on average. Bias and variance provide more information about the generalization gap than just test error alone: We can better understand whether an explanation works by checking whether it predicts both the bias and variance. How does adversarial training affect the bias and the variance?


Exact Gap between Generalization Error and Uniform Convergence in Random Feature Models

arXiv.org Machine Learning

Recent work showed that there could be a large gap between the classical uniform convergence bound and the actual test error of zero-training-error predictors (interpolators) such as deep neural networks. To better understand this gap, we study the uniform convergence in the nonlinear random feature model and perform a precise theoretical analysis on how uniform convergence depends on the sample size and the number of parameters. We derive and prove analytical expressions for three quantities in this model: 1) classical uniform convergence over norm balls, 2) uniform convergence over interpolators in the norm ball (recently proposed by Zhou et al. (2020)), and 3) the risk of minimum norm interpolator. We show that, in the setting where the classical uniform convergence bound is vacuous (diverges to $\infty$), uniform convergence over the interpolators still gives a non-trivial bound of the test error of interpolating solutions. We also showcase a different setting where classical uniform convergence bound is non-vacuous, but uniform convergence over interpolators can give an improved sample complexity guarantee. Our result provides a first exact comparison between the test errors and uniform convergence bounds for interpolators beyond simple linear models.


Complete Dictionary Learning via $\ell^4$-Norm Maximization over the Orthogonal Group

arXiv.org Machine Learning

This paper considers the fundamental problem of learning a complete (orthogonal) dictionary from samples of sparsely generated signals. Most existing methods solve the dictionary (and sparse representations) based on heuristic algorithms, usually without theoretical guarantees for either optimality or complexity. The recent $\ell^1$-minimization based methods do provide such guarantees but the associated algorithms recover the dictionary one column at a time. In this work, we propose a new formulation that maximizes the $\ell^4$-norm over the orthogonal group, to learn the entire dictionary. We prove that under a random data model, with nearly minimum sample complexity, the global optima of the $\ell^4$ norm are very close to signed permutations of the ground truth. Inspired by this observation, we give a conceptually simple and yet effective algorithm based on `matching, stretching, and projection' (MSP). The algorithm provably converges locally at a superlinear (cubic) rate and cost per iteration is merely an SVD. In addition to strong theoretical guarantees, experiments show that the new algorithm is significantly more efficient and effective than existing methods, including KSVD and $\ell^1$-based methods. Preliminary experimental results on real images clearly demonstrate advantages of so learned dictionary over classic PCA bases.