Nati Srebro


Implicit Bias of Gradient Descent on Linear Convolutional Networks

Neural Information Processing Systems

Large scale neural networks used in practice are highly over-parameterized with far more trainable model parameters compared to the number of training examples. Consequently, optimization objectives for learning such high capacity models have many global minima that fit training data perfectly. However, minimizing the training loss using specific optimization algorithms take us to not just any global minima, but some special global minima, e.g., global minima minimizing some regularizer R(ฮฒ). In over-parameterized models, specially deep neural networks, much, if not most, of the inductive bias of the learned model comes from this implicit regularization from the optimization algorithm. Understanding the implicit bias, e.g., via characterizing R(ฮฒ), is thus essential for understanding how and what the model learns.





Normalized Spectral Map Synchronization

Neural Information Processing Systems

Estimating maps among large collections of objects (e.g., dense correspondences across images and 3D shapes) is a fundamental problem across a wide range of domains. In this paper, we provide theoretical justifications of spectral techniques for the map synchronization problem, i.e., it takes as input a collection of objects and noisy maps estimated between pairs of objects along a connected object graph, and outputs clean maps between all pairs of objects. We show that a simple normalized spectral method (or NormSpecSync) that projects the blocks of the top eigenvectors of a data matrix to the map space, exhibits surprisingly good behavior -- NormSpecSync is much more efficient than state-of-the-art convex optimization techniques, yet still admitting similar exact recovery conditions. We demonstrate the usefulness of NormSpecSync on both synthetic and real datasets.


Equality of Opportunity in Supervised Learning

Neural Information Processing Systems

We propose a criterion for discrimination against a specified sensitive attribute in supervised learning, where the goal is to predict some target based on available features. Assuming data about the predictor, target, and membership in the protected group are available, we show how to optimally adjust any learned predictor so as to remove discrimination according to our definition. Our framework also improves incentives by shifting the cost of poor classification from disadvantaged groups to the decision maker, who can respond by improving the classification accuracy. We enourage readers to consult the more complete manuscript on the arXiv.


Path-Normalized Optimization of Recurrent Neural Networks with ReLU Activations

Neural Information Processing Systems

We investigate the parameter-space geometry of recurrent neural networks (RNNs), and develop an adaptation of path-SGD optimization method, attuned to this geometry, that can learn plain RNNs with ReLU activations. On several datasets that require capturing long-term dependency structure, we show that path-SGD can significantly improve trainability of ReLU RNNs compared to RNNs trained with SGD, even with various recently suggested initialization schemes.


Tight Complexity Bounds for Optimizing Composite Objectives

Neural Information Processing Systems

We provide tight upper and lower bounds on the complexity of minimizing the average of m convex functions using gradient and prox oracles of the component functions. We show a significant gap between the complexity of deterministic vs randomized optimization. For smooth functions, we show that accelerated gradient descent (AGD) and an accelerated variant of SVRG are optimal in the deterministic and randomized settings respectively, and that a gradient oracle is sufficient for the optimal rate. For non-smooth functions, having access to prox oracles reduces the complexity and we present optimal methods based on smoothing that improve over methods using just gradient accesses.


Efficient Globally Convergent Stochastic Optimization for Canonical Correlation Analysis

Neural Information Processing Systems

We study the stochastic optimization of canonical correlation analysis (CCA), whose objective is nonconvex and does not decouple over training samples. Although several stochastic gradient based optimization algorithms have been recently proposed to solve this problem, no global convergence guarantee was provided by any of them. Inspired by the alternating least squares/power iterations formulation of CCA, and the shift-and-invert preconditioning method for PCA, we propose two globally convergent meta-algorithms for CCA, both of which transform the original problem into sequences of least squares problems that need only be solved approximately. We instantiate the meta-algorithms with state-of-the-art SGD methods and obtain time complexities that significantly improve upon that of previous work. Experimental results demonstrate their superior performance.


Implicit Regularization in Matrix Factorization

Neural Information Processing Systems

We study implicit regularization when optimizing an underdetermined quadratic objective over a matrix X with gradient descent on a factorization of X. We conjecture and provide empirical and theoretical evidence that with small enough step sizes and initialization close enough to the origin, gradient descent on a full dimensional factorization converges to the minimum nuclear norm solution.