Gradient Descent
Semi-supervised Learning based on Distributionally Robust Optimization
We propose a novel method for semi-supervised learning (SSL) based on data-driven distributionally robust optimization (DRO) using optimal transport metrics. Our proposed method enhances generalization error by using the unlabeled data to restrict the support of the worst case distribution in our DRO formulation. We enable the implementation of our DRO formulation by proposing a stochastic gradient descent algorithm which allows to easily implement the training procedure. We demonstrate that our Semi-supervised DRO method is able to improve the generalization error over natural supervised procedures and state-of-the-art SSL estimators. Finally, we include a discussion on the large sample behavior of the optimal uncertainty region in the DRO formulation.
HAMSI: A Parallel Incremental Optimization Algorithm Using Quadratic Approximations for Solving Partially Separable Problems
Kaya, Kamer, รztoprak, Figen, Birbil, ล. ฤฐlker, Cemgil, A. Taylan, ลimลekli, Umut, Kuru, Nurdan, Koptagel, Hazal, รztรผrk, M. Kaan
We propose HAMSI (Hessian Approximated Multiple Subsets Iteration), which is a provably convergent, second order incremental algorithm for solving large-scale partially separable optimization problems. The algorithm is based on a local quadratic approximation, and hence, allows incorporating curvature information to speed-up the convergence. HAMSI is inherently parallel and it scales nicely with the number of processors. Combined with techniques for effectively utilizing modern parallel computer architectures, we illustrate that the proposed method converges more rapidly than a parallel stochastic gradient descent when both methods are used to solve large-scale matrix factorization problems. This performance gain comes only at the expense of using memory that scales linearly with the total size of the optimization variables. We conclude that HAMSI may be considered as a viable alternative in many large scale problems, where first order methods based on variants of stochastic gradient descent are applicable.
Algorithmic stability and hypothesis complexity
Liu, Tongliang, Lugosi, Gรกbor, Neu, Gergely, Tao, Dacheng
We introduce a notion of algorithmic stability of learning algorithms---that we term \emph{argument stability}---that captures stability of the hypothesis output by the learning algorithm in the normed space of functions from which hypotheses are selected. The main result of the paper bounds the generalization error of any learning algorithm in terms of its argument stability. The bounds are based on martingale inequalities in the Banach space to which the hypotheses belong. We apply the general bounds to bound the performance of some learning algorithms based on empirical risk minimization and stochastic gradient descent.
Local Asymptotics for Stochastic Optimization: Optimality, Constraint Identification, and Dual Averaging
We study local complexity measures for stochastic convex optimization problems, providing a local minimax theory analogous to that of H\'{a}jek and Le Cam for classical statistical problems, and giving efficient procedures based on Nesterov's dual averaging that (often) adaptively achieve optimal convergence guarantees. Our results provide function-specific lower bounds and convergence results that make precise a correspondence between statistical difficulty and the geometric notion of tilt-stability from optimization. We show how variants of dual averaging---a stochastic gradient-based procedure---guarantee finite time identification of constraints in optimization problems, while stochastic gradient procedures provably fail. Additionally, we highlight a gap between optimization problems with linear and nonlinear constraints: standard stochastic-gradient-based procedures are suboptimal even for the simplest nonlinear constraints.
Learning ReLUs via Gradient Descent
In this paper we study the problem of learning Rectified Linear Units (ReLUs) which are functions of the form $max(0,
Efficient Regret Minimization in Non-Convex Games
Hazan, Elad, Singh, Karan, Zhang, Cyril
We consider regret minimization in repeated games with non-convex loss functions. Minimizing the standard notion of regret is computationally intractable. Thus, we define a natural notion of regret which permits efficient optimization and generalizes offline guarantees for convergence to an approximate local optimum. We give gradient-based methods that achieve optimal regret, which in turn guarantee convergence to equilibrium in this framework.
A Robust Multi-Batch L-BFGS Method for Machine Learning
Berahas, Albert S., Takรกฤ, Martin
This paper describes an implementation of the L-BFGS method designed to deal with two adversarial situations. The first occurs in distributed computing environments where some of the computational nodes devoted to the evaluation of the function and gradient are unable to return results on time. A similar challenge occurs in a multi-batch approach in which the data points used to compute function and gradients are purposely changed at each iteration to accelerate the learning process. Difficulties arise because L-BFGS employs gradient differences to update the Hessian approximations, and when these gradients are computed using different data points the updating process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, studies the convergence properties for both convex and nonconvex functions, and illustrates the behavior of the algorithm in a distributed computing platform on binary classification logistic regression and neural network training problems that arise in machine learning.
Stochastic Alternating Direction Method of Multipliers with Variance Reduction for Nonconvex Optimization
Huang, Feihu, Chen, Songcan, Lu, Zhaosong
In the paper, we study the stochastic alternating direction method of multipliers (ADMM) for the nonconvex optimizations, and propose three classes of the nonconvex stochastic ADMM with variance reduction, based on different reduced variance stochastic gradients. Specifically, the first class called the nonconvex stochastic variance reduced gradient ADMM (SVRG-ADMM), uses a multi-stage scheme to progressively reduce the variance of stochastic gradients. The second is the nonconvex stochastic average gradient ADMM (SAG-ADMM), which additionally uses the old gradients estimated in the previous iteration. The third called SAGA-ADMM is an extension of the SAG-ADMM method. Moreover, under some mild conditions, we establish the iteration complexity bound of $O(1/\epsilon)$ of the proposed methods to obtain an $\epsilon$-stationary solution of the nonconvex optimizations. In particular, we provide a general framework to analyze the iteration complexity of these nonconvex stochastic ADMM methods with variance reduction. Finally, some numerical experiments demonstrate the effectiveness of our methods.
Prediction-Constrained Training for Semi-Supervised Mixture and Topic Models
Hughes, Michael C., Weiner, Leah, Hope, Gabriel, McCoy, Thomas H. Jr., Perlis, Roy H., Sudderth, Erik B., Doshi-Velez, Finale
Supervisory signals have the potential to make low-dimensional data representations, like those learned by mixture and topic models, more interpretable and useful. We propose a framework for training latent variable models that explicitly balances two goals: recovery of faithful generative explanations of high-dimensional data, and accurate prediction of associated semantic labels. Existing approaches fail to achieve these goals due to an incomplete treatment of a fundamental asymmetry: the intended application is always predicting labels from data, not data from labels. Our prediction-constrained objective for training generative models coherently integrates loss-based supervisory signals while enabling effective semi-supervised learning from partially labeled data. We derive learning algorithms for semi-supervised mixture and topic models using stochastic gradient descent with automatic differentiation. We demonstrate improved prediction quality compared to several previous supervised topic models, achieving predictions competitive with high-dimensional logistic regression on text sentiment analysis and electronic health records tasks while simultaneously learning interpretable topics.
Towards Decoding as Continuous Optimization in Neural Machine Translation
Hoang, Cong Duy Vu, Haffari, Gholamreza, Cohn, Trevor
We propose a novel decoding approach for neural machine translation (NMT) based on continuous optimisation. We convert decoding - basically a discrete optimization problem - into a continuous optimization problem. The resulting constrained continuous optimisation problem is then tackled using gradient-based methods. Our powerful decoding framework enables decoding intractable models such as the intersection of left-to-right and right-to-left (bidirectional) as well as source-to-target and target-to-source (bilingual) NMT models. Our empirical results show that our decoding framework is effective, and leads to substantial improvements in translations generated from the intersected models where the typical greedy or beam search is not feasible. We also compare our framework against reranking, and analyse its advantages and disadvantages.