AITopics | statistical optimality

We consider stochastic gradient descent (SGD) for least-squares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on unseen data, the existing theoretical analysis of SGD suggests that a single pass is statistically optimal. While this is true for low-dimensional easy problems, we show that for hard problems, multiple passes lead to statistically optimal predictions while single pass does not; we also show that in these hard models, the optimal number of passes over the data increases with sample size. In order to define the notion of hardness and show that our predictive performances are optimal, we consider potentially infinite-dimensional models and notions typically associated to kernel methods, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariance matrix. We illustrate our results on synthetic experiments with non-linear kernel methods and on a classical benchmark with a linear model.

hard learning problem, statistical optimality, stochastic gradient descent, (7 more...)

Neural Information Processing Systems

Industry: Education > Focused Education > Special Education (0.43)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.67)

Add feedback

Sobolev Acceleration and Statistical Optimality for Learning Elliptic Equations via Gradient Descent

Neural Information Processing SystemsAug-19-2025, 06:58:29 GMT

We apply gradient descent to a general class of objective functions for the reconstruction.

artificial intelligence, arxiv preprint arxiv, machine learning, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.73)

Add feedback

Sobolev Acceleration and Statistical Optimality for Learning Elliptic Equations via Gradient Descent

Neural Information Processing SystemsAug-19-2025, 06:58:25 GMT

We apply gradient descent to a general class of objective functions for the reconstruction.

artificial intelligence, arxiv preprint arxiv, machine learning, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)

Add feedback

Reviews: Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

Neural Information Processing SystemsOct-7-2024, 05:11:12 GMT

This paper identifies and separates (kernel) linear least-squares regression problems wherein carrying out multiple passes of stochastic gradient descent (SGD) over a training set can yield better statistical error than only a single pass. This is relevant to the core of machine learning theory, and relates to a line of work published at NIPS, ICML, COLT, and similar conferences in the past several years about the statistical error of one-pass, many-pass, and ERM-based learning. The authors focus on regression problems captured, by assumption, by two parameters: alpha, which governs the exponent of a power-law eigenvalue decay, and r, which governs a transformation under which the Hilbert norm of the optimal predictor is bounded. They refer to problems where r (alpha - 1) / (2 * alpha) as "hard". The main result of the paper is to show that for these "hard" problems, multiple SGD passes either achieve (minimax) optimal rates of statistical estimation, or at least improve the rate relative to a single pass. The results are interesting and might address an unanswered core question in machine learning, and the mathematical presentation is clear, with assumptions upfront.

hard learning problem, statistical optimality, stochastic gradient descent, (11 more...)

Neural Information Processing Systems

Industry: Education > Focused Education > Special Education (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Sobolev Acceleration and Statistical Optimality for Learning Elliptic Equations via Gradient Descent

Lu, Yiping, Blanchet, Jose, Ying, Lexing

arXiv.org Artificial IntelligenceSep-19-2022

In this paper, we study the statistical limits in terms of Sobolev norms of gradient descent for solving inverse problem from randomly sampled noisy observations using a general class of objective functions. Our class of objective functions includes Sobolev training for kernel regression, Deep Ritz Methods (DRM), and Physics Informed Neural Networks (PINN) for solving elliptic partial differential equations (PDEs) as special cases. We consider a potentially infinite-dimensional parameterization of our model using a suitable Reproducing Kernel Hilbert Space and a continuous parameterization of problem hardness through the definition of kernel integral operators. We prove that gradient descent over this objective function can also achieve statistical optimality and the optimal number of passes over the data increases with sample size. Based on our theory, we explain an implicit acceleration of using a Sobolev norm as the objective function for training, inferring that the optimal number of epochs of DRM becomes larger than the number of PINN when both the data size and the hardness of tasks increase, although both DRM and PINN can achieve statistical optimality.

artificial intelligence, arxiv preprint arxiv, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2205.07331

Country: North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.93)

Add feedback

Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

Pillaud-Vivien, Loucas, Rudi, Alessandro, Bach, Francis

Neural Information Processing SystemsFeb-14-2020, 20:11:59 GMT

We consider stochastic gradient descent (SGD) for least-squares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on unseen data, the existing theoretical analysis of SGD suggests that a single pass is statistically optimal. While this is true for low-dimensional easy problems, we show that for hard problems, multiple passes lead to statistically optimal predictions while single pass does not; we also show that in these hard models, the optimal number of passes over the data increases with sample size. In order to define the notion of hardness and show that our predictive performances are optimal, we consider potentially infinite-dimensional models and notions typically associated to kernel methods, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariance matrix. We illustrate our results on synthetic experiments with non-linear kernel methods and on a classical benchmark with a linear model.

hard learning problem, statistical optimality, stochastic gradient descent, (5 more...)

Neural Information Processing Systems

Genre: Research Report (0.44)

Industry: Education > Focused Education > Special Education (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Statistical Optimality of Interpolated Nearest Neighbor Algorithms

Xing, Yue, Song, Qifan, Cheng, Guang

arXiv.org Machine LearningNov-30-2018

In the era of deep learning, understanding over-fitting phenomenon becomes increasingly important. It is observed that carefully designed deep neural networks achieve small testing error even when the training error is close to zero. One possible explanation is that for many modern machine learning algorithms, over-fitting can greatly reduce the estimation bias, while not increasing the estimation variance too much. To illustrate the above idea, we prove that the proposed interpolated nearest neighbor algorithm achieves the minimax optimal rate in both regression and classification regimes, and observe that they are empirically better than the traditional $k$ nearest neighbor method in some cases.

artificial intelligence, assumption, machine learning, (15 more...)

arXiv.org Machine Learning

1810.02814

Country: North America > United States > Indiana > Tippecanoe County (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (1.00)

Add feedback

Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging

Jain, Prateek, Kakade, Sham M., Kidambi, Rahul, Netrapalli, Praneeth, Sidford, Aaron

arXiv.org Machine LearningOct-27-2016

This work characterizes the benefits of averaging techniques widely used in conjunction with stochastic gradient descent (SGD). In particular, this work sharply analyzes: (1) mini-batching, a method of averaging many samples of the gradient to both reduce the variance of a stochastic gradient estimate and for parallelizing SGD and (2) tail-averaging, a method involving averaging the final few iterates of SGD in order to decrease the variance in SGD's final iterate. This work presents the first tight non-asymptotic generalization error bounds for these schemes for the stochastic approximation problem of least squares regression. Furthermore, this work establishes a precise problem-dependent extent to which mini-batching can be used to yield provable near-linear parallelization speedups over SGD with batch size one. These results are utilized in providing a highly parallelizable SGD algorithm that obtains the optimal statistical error rate with nearly the same number of serial updates as batch gradient descent, which improves significantly over existing SGD-style methods. Finally, this work sheds light on some fundamental differences in SGD's behavior when dealing with agnostic noise in the (non-realizable) least squares regression problem. In particular, the work shows that the stepsizes that ensure optimal statistical error rates for the agnostic case must be a function of the noise properties. The central analysis tools used by this paper are obtained through generalizing the operator view of averaged SGD, introduced by Defossez and Bach (2015) followed by developing a novel analysis in bounding these operators to characterize the generalization error. These techniques may be of broader interest in analyzing various computational aspects of stochastic approximation.

artificial intelligence, generalization error, machine learning, (16 more...)

arXiv.org Machine Learning

1610.03774

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

Add feedback

Filters

Collaborating Authors

statistical optimality

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

d6c53fe062716387ff0df73cc53de60c-Supplemental-Conference.pdf

d6c53fe062716387ff0df73cc53de60c-Paper-Conference.pdf

Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

Sobolev Acceleration and Statistical Optimality for Learning Elliptic Equations via Gradient Descent

Sobolev Acceleration and Statistical Optimality for Learning Elliptic Equations via Gradient Descent

Reviews: Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

Sobolev Acceleration and Statistical Optimality for Learning Elliptic Equations via Gradient Descent

Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

Statistical Optimality of Interpolated Nearest Neighbor Algorithms

Parallelizing Stochastic Approximation Through Mini-Batching and Tail-Averaging