Gradient Descent
Stability and Generalization Analysis of Gradient Methods for Shallow Neural Networks Yunwen Lei
While significant theoretical progress has been achieved, unveiling the generalization mystery of overparameterized neural networks still remains largely elusive. In this paper, we study the generalization behavior of shallow neural networks (SNNs) by leveraging the concept of algorithmic stability. We consider gradient descent (GD) and stochastic gradient descent (SGD) to train SNNs, for both of which we develop consistent excess risk bounds by balancing the optimization and generalization via early-stopping.
Accelerating SGD for Highly Ill-Conditioned Huge-Scale Online Matrix Completion
Gavin Zhang, University of Illinois at UrbanaโChampaign, jialun2@illinois.edu, "3026 Hong-Ming Chiu, University of Illinois at UrbanaโChampaign, hmchiu2@illinois.edu, "3026 Richard Y. Zhang, University of Illinois at UrbanaโChampaign, ryz@illinois.edu
The matrix completion problem seeks to recover a d d ground truth matrix of low rank r d from observations of its individual elements. Real-world matrix completion is often a huge-scale optimization problem, with d so large that even the simplest full-dimension vector operations with O ( d) time complexity become prohibitively expensive. Stochastic gradient descent (SGD) is one of the few algorithms capable of solving matrix completion on a huge scale, and can also naturally handle streaming data over an evolving ground truth. Unfortunately, SGD experiences a dramatic slow-down when the underlying ground truth is ill-conditioned; it requires at least O ( log(1 /)) iterations to get -close to ground truth matrix with condition number . In this paper, we propose a preconditioned version of SGD that preserves all the favorable practical qualities of SGD for huge-scale online optimization while also making it agnostic to . For a symmetric ground truth and the Root Mean Square Error (RMSE) loss, we prove that the preconditioned SGD converges to -accuracy in O (log(1 /)) iterations, with a rapid linear convergence rate as if the ground truth were perfectly conditioned with =1 . In our experiments, we observe a similar acceleration for item-item collaborative filtering on the MovieLens25M dataset via a pair-wise ranking loss, with 100 million training pairs and 10 million testing pairs.
Implicit Regularization or Implicit Conditioning Exact Risk Trajectories of in High Dimensions
Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems. While the empirical success of SGD is often attributed to its computational efficiency and favorable generalization behavior, neither effect is well understood and disentangling them remains an open problem.