Lacoste-Julien, Simon
A Closer Look at the Optimization Landscapes of Generative Adversarial Networks
Berard, Hugo, Gidel, Gauthier, Almahairi, Amjad, Vincent, Pascal, Lacoste-Julien, Simon
Generative adversarial networks have been very successful in generative modeling, however they remain relatively hard to optimize compared to standard deep neural networks. In this paper, we try to gain insight into the optimization of GANs by looking at the game vector field resulting from the concatenation of the gradient of both players. Based on this point of view, we propose visualization techniques that allow us to make the following empirical observations. First, the training of GANs suffers from rotational behavior around locally stable stationary points, which, as we show, corresponds to the presence of imaginary components in the eigenvalues of the Jacobian of the game. Secondly, GAN training seems to converge to a stable stationary point which is a saddle point for the generator loss, not a minimum, while still achieving excellent performance. This counter-intuitive yet persistent observation questions whether we actually need a Nash equilibrium to get good performance in GANs.
Gradient-Based Neural DAG Learning
Lachapelle, Sébastien, Brouillard, Philippe, Deleu, Tristan, Lacoste-Julien, Simon
We propose a novel score-based approach to learning a directed acyclic graph (DAG) from observational data. We adapt a recently proposed continuous constrained optimization formulation to allow for nonlinear relationships between variables using neural networks. This extension allows to model complex interactions while being more global in its search compared to other greedy approaches. In addition to comparing our method to existing continuous optimization methods, we provide missing empirical comparisons to nonlinear greedy search methods. On both synthetic and real-world data sets, this new method outperforms current continuous methods on most tasks while being competitive with existing greedy search methods on important metrics for causal inference.
Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates
Vaswani, Sharan, Mishkin, Aaron, Laradji, Issam, Schmidt, Mark, Gidel, Gauthier, Lacoste-Julien, Simon
Recent works have shown that stochastic gradient descent (SGD) achieves the fast convergence rates of full-batch gradient descent for over-parameterized models satisfying certain interpolation conditions. However, the step-size used in these works depends on unknown quantities, and SGD's practical performance heavily relies on the choice of the step-size. We propose to use line-search methods to automatically set the step-size when training models that can interpolate the data. We prove that SGD with the classic Armijo line-search attains the fast convergence rates of full-batch gradient descent in convex and strongly-convex settings. We also show that under additional assumptions, SGD with a modified line-search can attain a fast rate of convergence for non-convex functions. Furthermore, we show that a stochastic extra-gradient method with a Lipschitz line-search attains a fast convergence rates for an important class of non-convex functions and saddle-point problems satisfying interpolation. We then give heuristics to use larger stepsizes and acceleration with our line-search techniques. We compare the proposed algorithms against numerous optimization methods for standard classification tasks using both kernel methods and deep networks. The proposed methods are robust and result in competitive performance across all models and datasets.
Implicit Regularization of Discrete Gradient Dynamics in Deep Linear Neural Networks
Gidel, Gauthier, Bach, Francis, Lacoste-Julien, Simon
When optimizing over-parameterized models, such as deep neural networks, a large set of parameters leads to a zero training error. However they lead to different values for the test error and thus have distinct generalization properties. More specifically, Neyshabur [2017, Part II] argues that the choice of the optimization algorithm (and its respective hyperparameters) provides an implicit regularization with respect to its geometry: it biases the training, finding a particular minimizer of the objective. In this work, we use the same setting as Saxe et al. [2018]: a regression problem with least-square loss on a multidimensional output. Our prediction is made either by a linear model or by a two-layer linear neural network [Saxe et al., 2018]. Our goal is to extend their work on the continuous gradient dynamics in order to understand the behavior of the discrete dynamics induced by these two models. We show that with a vanishing initialization and a small enough step-size, the gradient dynamics of the two-layer linear neural network sequentially learns components that can be ranked according to a hierarchical structure whereas the gradient dynamics of the linear model learns the same components at the same time, missing this notion of hierarchy between components. The path followed by the two-layer formulation actually corresponds to successively solving the initial regression problem with a growing low rank constraint which is also know as reduced-rank regression [Izenman, 1975]. Note that this notion of path followed by the dynamics of a whole network is different from the notion of path introduced by Neyshabur et al. [2015a] which
Reducing Noise in GAN Training with Variance Reduced Extragradient
Chavdarova, Tatjana, Gidel, Gauthier, Fleuret, François, Lacoste-Julien, Simon
Using large mini-batches when training generative adversarial networks (GANs) has been recently shown to significantly improve the quality of the generated samples. This can be seen as a simple but computationally expensive way of reducing the noise of the gradient estimates. In this paper, we investigate the effect of the noise in this context and show that it can prevent the convergence of standard stochastic game optimization methods, while their respective batch version converges. To address this issue, we propose a variance-reduced version of the stochastic extragradient algorithm (SVRE). We show experimentally that it performs similarly to a batch method, while being computationally cheaper, and show its theoretical convergence, improving upon the best rates proposed in the literature. Experiments on several datasets show that SVRE improves over baselines. Notably, SVRE is the first optimization method for GANs to our knowledge that can produce near state-of-the-art results without using adaptive step-size such as Adam.
Centroid Networks for Few-Shot Clustering and Unsupervised Few-Shot Classification
Huang, Gabriel, Larochelle, Hugo, Lacoste-Julien, Simon
Traditional clustering algorithms such as K-means rely heavily on the nature of the chosen metric or data representation. To get meaningful clusters, these representations need to be tailored to the downstream task (e.g. cluster photos by object category, cluster faces by identity). Therefore, we frame clustering as a meta-learning task, few-shot clustering, which allows us to specify how to cluster the data at the meta-training level, despite the clustering algorithm itself being unsupervised. We propose Centroid Networks, a simple and efficient few-shot clustering method based on learning representations which are tailored both to the task to solve and to its internal clustering module. We also introduce unsupervised few-shot classification, which is conceptually similar to few-shot clustering, but is strictly harder than supervised* few-shot classification and therefore allows direct comparison with existing supervised few-shot classification methods. On Omniglot and miniImageNet, our method achieves accuracy competitive with popular supervised few-shot classification algorithms, despite using *no labels* from the support set. We also show performance competitive with state-of-the-art learning-to-cluster methods.
Predicting Tactical Solutions to Operational Planning Problems under Imperfect Information
Larsen, Eric, Lachapelle, Sébastien, Bengio, Yoshua, Frejinger, Emma, Lacoste-Julien, Simon, Lodi, Andrea
This paper offers a methodological contribution at the intersection of machine learning and operations research. Namely, we propose a methodology to quickly predict tactical solutions to a given operational problem. In this context, the tactical solution is less detailed than the operational one but it has to be computed in very short time and under imperfect information. The problem is of importance in various applications where tactical and operational planning problems are interrelated and information about the operational problem is revealed over time. This is for instance the case in certain capacity planning and demand management systems. We formulate the problem as a two-stage optimal prediction stochastic program whose solution we predict with a supervised machine learning algorithm. The training data set consists of a large number of deterministic (second stage) problems generated by controlled probabilistic sampling. The labels are computed based on solutions to the deterministic problems (solved independently and offline) employing appropriate aggregation and subselection methods to address uncertainty. Results on our motivating application in load planning for rail transportation show that deep learning algorithms produce highly accurate predictions in very short computing time (milliseconds or less). The prediction accuracy is comparable to solutions computed by sample average approximation of the stochastic program.
Quantifying Learning Guarantees for Convex but Inconsistent Surrogates
Struminsky, Kirill, Lacoste-Julien, Simon, Osokin, Anton
We study consistency properties of machine learning methods based on minimizing convex surrogates. We extend the recent framework of Osokin et al. (2017) for the quantitative analysis of consistency properties to the case of inconsistent surrogates. Our key technical contribution consists in a new lower bound on the calibration function for the quadratic surrogate, which is non-trivial (not always zero) for inconsistent cases. The new bound allows to quantify the level of inconsistency of the setting and shows how learning with inconsistent surrogates can have guarantees on sample complexity and optimization difficulty. We apply our theory to two concrete cases: multi-class classification with the tree-structured loss and ranking with the mean average precision loss. The results show the approximation-computation trade-offs caused by inconsistent surrogates and their potential benefits.
Quantifying Learning Guarantees for Convex but Inconsistent Surrogates
Struminsky, Kirill, Lacoste-Julien, Simon, Osokin, Anton
We study consistency properties of machine learning methods based on minimizing convex surrogates. We extend the recent framework of Osokin et al. (2017) for the quantitative analysis of consistency properties to the case of inconsistent surrogates. Our key technical contribution consists in a new lower bound on the calibration function for the quadratic surrogate, which is non-trivial (not always zero) for inconsistent cases. The new bound allows to quantify the level of inconsistency of the setting and shows how learning with inconsistent surrogates can have guarantees on sample complexity and optimization difficulty. We apply our theory to two concrete cases: multi-class classification with the tree-structured loss and ranking with the mean average precision loss. The results show the approximation-computation trade-offs caused by inconsistent surrogates and their potential benefits.
Quantifying Learning Guarantees for Convex but Inconsistent Surrogates
Struminsky, Kirill, Lacoste-Julien, Simon, Osokin, Anton
We study consistency properties of machine learning methods based on minimizing convex surrogates. We extend the recent framework of Osokin et al. (2017) for the quantitative analysis of consistency properties to the case of inconsistent surrogates. Our key technical contribution consists in a new lower bound on the calibration function for the quadratic surrogate, which is non-trivial (not always zero) for inconsistent cases. The new bound allows to quantify the level of inconsistency of the setting and shows how learning with inconsistent surrogates can have guarantees on sample complexity and optimization difficulty. We apply our theory to two concrete cases: multi-class classification with the tree-structured loss and ranking with the mean average precision loss. The results show the approximation-computation trade-offs caused by inconsistent surrogates and their potential benefits.