Goto

Collaborating Authors

 Optimization


From optimal transport to generative modeling: the VEGAN cookbook

arXiv.org Machine Learning

We study unsupervised generative modeling in terms of the optimal transport (OT) problem between true (but unknown) data distribution $P_X$ and the latent variable model distribution $P_G$. We show that the OT problem can be equivalently written in terms of probabilistic encoders, which are constrained to match the posterior and prior distributions over the latent space. When relaxed, this constrained optimization problem leads to a penalized optimal transport (POT) objective, which can be efficiently minimized using stochastic gradient descent by sampling from $P_X$ and $P_G$. We show that POT for the 2-Wasserstein distance coincides with the objective heuristically employed in adversarial auto-encoders (AAE) (Makhzani et al., 2016), which provides the first theoretical justification for AAEs known to the authors. We also compare POT to other popular techniques like variational auto-encoders (VAE) (Kingma and Welling, 2014). Our theoretical results include (a) a better understanding of the commonly observed blurriness of images generated by VAEs, and (b) establishing duality between Wasserstein GAN (Arjovsky and Bottou, 2017) and POT for the 1-Wasserstein distance.


Accelerated Stochastic Quasi-Newton Optimization on Riemann Manifolds

arXiv.org Machine Learning

Optimization algorithms are a mainstay in machine learning research, underpinning solvers for a wide swath of problems ranging from linear regression and SVMs to deep learning. Consequently, scaling such algorithms to large scale datasets while preserving theoretical guarantees is of paramount importance. An important challenge in this field is designing scalable algorithms for optimization problems in the presence of constraints on the search space, a situation all too often encountered in real life. One approach to handling such constrained optimization problems on vector spaces is to reformulate them as optimization tasks on a suitable Riemannian manifold, with the constraints acting as manifold parametrization. Often, the problems can be shown to possess desirable geometric properties like convexity with respect to distance-minimizing geodesics on the manifold, leading to provably efficient optimization algorithms [1, 2, 3, 4].


Large Scale Empirical Risk Minimization via Truncated Adaptive Newton Method

arXiv.org Machine Learning

We consider large scale empirical risk minimization (ERM) problems, where both the problem dimension and variable size is large. In these cases, most second order methods are infeasible due to the high cost in both computing the Hessian over all samples and computing its inverse in high dimensions. In this paper, we propose a novel adaptive sample size second-order method, which reduces the cost of computing the Hessian by solving a sequence of ERM problems corresponding to a subset of samples and lowers the cost of computing the Hessian inverse using a truncated eigenvalue decomposition. We show that while we geometrically increase the size of the training set at each stage, a single iteration of the truncated Newton method is sufficient to solve the new ERM within its statistical accuracy. Moreover, for a large number of samples we are allowed to double the size of the training set at each stage, and the proposed method subsequently reaches the statistical accuracy of the full training set approximately after two effective passes. In addition to this theoretical result, we show empirically on a number of well known data sets that the proposed truncated adaptive sample size algorithm outperforms stochastic alternatives for solving ERM problems.


The Physical Systems Behind Optimization Algorithms

arXiv.org Machine Learning

We use differential equations based approaches to provide some {\it \textbf{physics}} insights into analyzing the dynamics of popular optimization algorithms in machine learning. In particular, we study gradient descent, proximal gradient descent, coordinate gradient descent, proximal coordinate gradient, and Newton's methods as well as their Nesterov's accelerated variants in a unified framework motivated by a natural connection of optimization algorithms to physical systems. Our analysis is applicable to more general algorithms and optimization problems {\it \textbf{beyond}} convexity and strong convexity, e.g. Polyak-\L ojasiewicz and error bound conditions (possibly nonconvex).


A Unified Framework for Stochastic Matrix Factorization via Variance Reduction

arXiv.org Machine Learning

We propose a unified framework to speed up the existing stochastic matrix factorization (SMF) algorithms via variance reduction. Our framework is general and it subsumes several well-known SMF formulations in the literature. We perform a non-asymptotic convergence analysis of our framework and derive computational and sample complexities for our algorithm to converge to an $\epsilon$-stationary point in expectation. In addition, extensive experiments for a wide class of SMF formulations demonstrate that our framework consistently yields faster convergence and a more accurate output dictionary vis-\`a-vis state-of-the-art frameworks.


An Investigation of Newton-Sketch and Subsampled Newton Methods

arXiv.org Machine Learning

The concepts of sketching and subsampling have recently received much attention by the optimization and statistics communities. In this paper, we study Newton-Sketch and Subsampled Newton (SSN) methods for the finite-sum optimization problem. We consider practical versions of the two methods in which the Newton equations are solved approximately using the conjugate gradient (CG) method or a stochastic gradient iteration. We establish new complexity results for the SSN-CG method that exploit the spectral properties of CG. Controlled numerical experiments compare the relative strengths of Newton-Sketch and SSN methods and show that for many finite-sum problems, they are far more efficient than SVRG, a popular first-order method.


Doubly Robust Data-Driven Distributionally Robust Optimization

arXiv.org Machine Learning

Data-driven Distributionally Robust Optimization (DD-DRO) via optimal transport has been shown to encompass a wide range of popular machine learning algorithms. The distributional uncertainty size is often shown to correspond to the regularization parameter. The type of regularization (e.g. the norm used to regularize) corresponds to the shape of the distributional uncertainty. We propose a data-driven robust optimization methodology to inform the transportation cost underlying the definition of the distributional uncertainty. We show empirically that this additional layer of robustification, which produces a method we called doubly robust data-driven distributionally robust optimization (DD-R-DRO), allows to enhance the generalization properties of regularized estimators while reducing testing error relative to state-of-the-art classifiers in a wide range of data sets.


Utilitarian Approach to Privacy in Distributed Constraint Optimization Problems

AAAI Conferences

Privacy has been a major motivation for distributed problem optimization. However, even though several methods have been proposed to evaluate it, none of them is widely used. The Distributed Constraint Optimization Problem (DCOP) is a fundamental model used to approach various families of distributed problems. Here we approach the problem by letting both the optimized costs found in DCOPs and the privacy requirements guide the agents' exploration of the search space. We introduce Utilitarian Distributed Constraint Optimization Problem (UDCOP) where the costs and the privacy requirements are used as parameters to a heuristic modifying the search process. Common stochastic algorithms for decentralized constraint optimization problems are evaluated here according to how well they preserve privacy.


Online Article Ranking as a Constrained, Dynamic, Multi-Objective Optimization Problem

AAAI Conferences

The content ranking problem in a social news website is typically a function that maximizes a scalar metric like dwell-time. However, in most real-world applications we are interested in more than one metric — for instance, simultaneously maximizing click-through rate, monetization metrics, and dwell-time — while also satisfying the constraints from traffic requirements promised to different publishers. The solution needs to be an online algorithm since the data arrives serially. Additionally, the objective function and the constraints can dynamically change. In this paper, we formulate this problem as a constrained, dynamic, multi-objective optimization problem. We propose a novel framework that extends a successful genetic optimization algorithm, NSGA-II, to solve our ranking problem. We evaluate optimization performance using the Hypervolume metric. We demonstrate the application of our framework on a real-world article ranking problem from the Yahoo! News page. We observe that our proposed solution makes considerable improvements in both time and performance over a brute-force baseline technique that is currently in production.


A sequential Monte Carlo approach to Thompson sampling for Bayesian optimization

arXiv.org Machine Learning

Bayesian optimization through Gaussian process regression is an effective method of optimizing an unknown function for which every measurement is expensive. It approximates the objective function and then recommends a new measurement point to try out. This recommendation is usually selected by optimizing a given acquisition function. After a sufficient number of measurements, a recommendation about the maximum is made. However, a key realization is that the maximum of a Gaussian process is not a deterministic point, but a random variable with a distribution of its own. This distribution cannot be calculated analytically. Our main contribution is an algorithm, inspired by sequential Monte Carlo samplers, that approximates this maximum distribution. Subsequently, by taking samples from this distribution, we enable Thompson sampling to be applied to (armed-bandit) optimization problems with a continuous input space. All this is done without requiring the optimization of a nonlinear acquisition function. Experiments have shown that the resulting optimization method has a competitive performance at keeping the cumulative regret limited.