Learning Curves for SGD on Structured Features

Bordelon, Blake, Pehlevan, Cengiz

arXiv.org Machine Learning 

Due to the challenge of modeling the structure of realistic data, theoretical studies of generalization often attempt to derive data-agnostic generalization bounds or study the typical performance of the algorithm on simple data distributions. The first set of theories derive bounds based on the complexity or capacity of the function class and often struggle to explain the success of modern learning systems which generalize well on real data but are sufficiently powerful to fit random noise [1, 2]. Rather than exploring data-independent worst-case performance, it is often useful to analyze how algorithms generalize typically or on average over a stipulated data distribution [3]. A typical assumption made in this style of analysis is that the data distribution possesses a high degree of symmetry by assuming the data follows a factorized probability distribution across input variables [4]. For example, spherical cow models treat data vectors as drawn from the isotropic Gaussian distribution or uniformly from the sphere while Boolean hypercube models treat data as random binary vectors. Models which study such simplified data distributions have been employed in several classic and recent studies exploring the capacity of supervised learning algorithms and associative memory [5, 6], overfitting peaks and phase transitions in learning [7, 8, 9, 10, 11, 12], and neural network training dynamics [13]. Rather than being distributed isotropically throughout the entire set of ambient dimensions, realistic datasets often lie on low dimensional structures. For example, MNIST and CIFAR-10 lie on surfaces with intrinsic dimension of 14 and 35 respectively [14].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found