AITopics | Yang, Zitong

Collaborating Authors

Yang, Zitong

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

s1: Simple test-time scaling

Muennighoff, Niklas, Yang, Zitong, Shi, Weijia, Li, Xiang Lisa, Fei-Fei, Li, Hajishirzi, Hannaneh, Zettlemoyer, Luke, Liang, Percy, Candès, Emmanuel, Hashimoto, Tatsunori

arXiv.org Artificial IntelligenceFeb-3-2025

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2501.19393

Country: North America > United States (0.45)

Genre:

Research Report (0.52)
Workflow (0.46)

Industry:

Leisure & Entertainment > Games (0.46)
Education > Educational Setting > K-12 Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.35)

Add feedback

Bellman Conformal Inference: Calibrating Prediction Intervals For Time Series

Yang, Zitong, Candès, Emmanuel, Lei, Lihua

arXiv.org Artificial IntelligenceFeb-9-2024

Uncertainty quantification for time series nowcasting and forecasting is crucial in many areas such as climate science, epidemiology, industrial engineering, and macroeconomics. Ideally, the forecaster would generate a prediction interval at each time period that is calibrated in the sense that the fraction of intervals covering the true outcomes is approximately equal to the target coverage level in the long run. Classical approaches for generating prediction intervals are mostly model-based Box and Jenkins [1976], Engle [1982a], Stock and Watson [2010], Brown [1964], Jorda [2005]. However, time series models are often mis-specified due to nonstationarity or changing environments. As a result, the model-based prediction intervals tend to be poorly calibrated (see for instance the gray curves in Figure 1). Moreover, many forecasters have upgraded their workflows by incorporating black-box machine learning algorithms [e.g. Taylor and Letham, 2018, Makridakis et al., 2018, Herzen et al., 2022], for which valid uncertainty quantification proves to be challenging.

artificial intelligence, machine learning, prediction interval, (16 more...)

arXiv.org Artificial Intelligence

2402.05203

Country:

Europe (0.68)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre:

Research Report (0.64)
Workflow (0.48)

Industry: Banking & Finance > Trading (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

ResMem: Learn what you can and memorize the rest

Yang, Zitong, Lukasik, Michal, Nagarajan, Vaishnavh, Li, Zonglin, Rawat, Ankit Singh, Zaheer, Manzil, Menon, Aditya Krishna, Kumar, Sanjiv

arXiv.org Machine LearningOct-20-2023

The impressive generalization performance of modern neural networks is attributed in part to their ability to implicitly memorize complex training patterns. Inspired by this, we explore a novel mechanism to improve model generalization via explicit memorization. Specifically, we propose the residual-memorization (ResMem) algorithm, a new method that augments an existing prediction model (e.g., a neural network) by fitting the model's residuals with a k-nearest neighbor based regressor. The final prediction is then the sum of the original model and the fitted residual regressor. By construction, ResMem can explicitly memorize the training labels, even when the base model has low capacity. We start by formulating a stylized linear regression problem and rigorously show that ResMem results in a more favorable test risk over a base linear neural network. Then, we empirically show that ResMem consistently improves the test set generalization of the original prediction model across standard vision and natural language processing benchmarks.

artificial intelligence, machine learning, resmem, (19 more...)

arXiv.org Machine Learning

2302.01576

Country:

North America > United States > New York (0.14)
North America > United States > California (0.14)

Genre:

Research Report > New Finding (0.67)
Research Report > Promising Solution (0.48)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

Understanding Generalization in Adversarial Training via the Bias-Variance Decomposition

Yu, Yaodong, Yang, Zitong, Dobriban, Edgar, Steinhardt, Jacob, Ma, Yi

arXiv.org Machine LearningMar-17-2021

Adversarial training enhances the robustness of deep neural networks at the cost of decreased accuracy on the clean test samples [Goodfellow et al., 2014, Madry et al., 2017, Sinha et al., 2017]. Though the model can fit the training data perfectly in adversarial training, the generalization error on clean test dataset increases compared with non-adversarially trained models. For example, in the rightmost panel of Figure 1(b), we can see that, even if an adversarially trained model achieves almost zero error on the clean training data (up to a certain level of perturbation ε), the error on the clean test data (the blue curve) keeps increasing with ε. Hence to improve both robustness and accuracy of (adversarially trained) deep networks, it is crucial to understand the cause for this increased "generalization gap" between errors on the (clean) training dataset and (clean) test dataset. In this work, to better understand the generalization gap, we turn to a standard tool of statistical learning theory, the bias-variance decomposition [Markov, 1900, Lehmann, 1983, Casella and Berger, 1990, Hastie et al., 2009, Geman et al., 1992]. A large variance corresponds to the instability of the model, whereas a large bias suggests that the model predicts poorly on average. Bias and variance provide more information about the generalization gap than just test error alone: We can better understand whether an explanation works by checking whether it predicts both the bias and variance. How does adversarial training affect the bias and the variance?

deep learning, neural network, variance, (17 more...)

arXiv.org Machine Learning

2103.09947

Country: North America > United States > California (0.14)

Genre: Research Report (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Exact Gap between Generalization Error and Uniform Convergence in Random Feature Models

Yang, Zitong, Bai, Yu, Mei, Song

arXiv.org Machine LearningMar-8-2021

Recent work showed that there could be a large gap between the classical uniform convergence bound and the actual test error of zero-training-error predictors (interpolators) such as deep neural networks. To better understand this gap, we study the uniform convergence in the nonlinear random feature model and perform a precise theoretical analysis on how uniform convergence depends on the sample size and the number of parameters. We derive and prove analytical expressions for three quantities in this model: 1) classical uniform convergence over norm balls, 2) uniform convergence over interpolators in the norm ball (recently proposed by Zhou et al. (2020)), and 3) the risk of minimum norm interpolator. We show that, in the setting where the classical uniform convergence bound is vacuous (diverges to $\infty$), uniform convergence over the interpolators still gives a non-trivial bound of the test error of interpolating solutions. We also showcase a different setting where classical uniform convergence bound is non-vacuous, but uniform convergence over interpolators can give an improved sample complexity guarantee. Our result provides a first exact comparison between the test errors and uniform convergence bounds for interpolators beyond simple linear models.

deep learning, neural network, uniform convergence, (14 more...)

arXiv.org Machine Learning

2103.04554

Country:

Europe (0.67)
North America > United States > California (0.14)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Complete Dictionary Learning via $\ell^4$-Norm Maximization over the Orthogonal Group

Zhai, Yuexiang, Yang, Zitong, Liao, Zhenyu, Wright, John, Ma, Yi

arXiv.org Machine LearningJun-6-2019

This paper considers the fundamental problem of learning a complete (orthogonal) dictionary from samples of sparsely generated signals. Most existing methods solve the dictionary (and sparse representations) based on heuristic algorithms, usually without theoretical guarantees for either optimality or complexity. The recent $\ell^1$-minimization based methods do provide such guarantees but the associated algorithms recover the dictionary one column at a time. In this work, we propose a new formulation that maximizes the $\ell^4$-norm over the orthogonal group, to learn the entire dictionary. We prove that under a random data model, with nearly minimum sample complexity, the global optima of the $\ell^4$ norm are very close to signed permutations of the ground truth. Inspired by this observation, we give a conceptually simple and yet effective algorithm based on `matching, stretching, and projection' (MSP). The algorithm provably converges locally at a superlinear (cubic) rate and cost per iteration is merely an SVD. In addition to strong theoretical guarantees, experiments show that the new algorithm is significantly more efficient and effective than existing methods, including KSVD and $\ell^1$-based methods. Preliminary experimental results on real images clearly demonstrate advantages of so learned dictionary over classic PCA bases.

algorithm, artificial intelligence, optimization problem, (19 more...)

arXiv.org Machine Learning

1906.02435

Country: North America > United States > California (0.27)

Genre: Research Report (0.81)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.88)

Add feedback