Optimization
Smoothed analysis of the low-rank approach for smooth semidefinite programs
Pumir, Thomas, Jelassi, Samy, Boumal, Nicolas
We consider semidefinite programs (SDPs) of size $n$ with equality constraints. In order to overcome scalability issues, Burer and Monteiro proposed a factorized approach based on optimizing over a matrix $Y$ of size $n\times k$ such that $X=YY^*$ is the SDP variable. The advantages of such formulation are twofold: the dimension of the optimization variable is reduced, and positive semidefiniteness is naturally enforced. However, optimization in $Y$ is non-convex. In prior work, it has been shown that, when the constraints on the factorized variable regularly define a smooth manifold, provided $k$ is large enough, for almost all cost matrices, all second-order stationary points (SOSPs) are optimal. Importantly, in practice, one can only compute points which approximately satisfy necessary optimality conditions, leading to the question: are such points also approximately optimal? To this end, and under similar assumptions, we use smoothed analysis to show that approximate SOSPs for a randomly perturbed objective function are approximate global optima, with $k$ scaling like the square root of the number of constraints (up to log factors). We particularize our results to an SDP relaxation of phase retrieval.
Decentralize and Randomize: Faster Algorithm for Wasserstein Barycenters
Dvurechenskii, Pavel, Dvinskikh, Darina, Gasnikov, Alexander, Uribe, Cesar, Nedich, Angelia
We study the decentralized distributed computation of discrete approximations for the regularized Wasserstein barycenter of a finite set of continuous probability measures distributedly stored over a network. We assume there is a network of agents/machines/computers, and each agent holds a private continuous probability measure and seeks to compute the barycenter of all the measures in the network by getting samples from its local measure and exchanging information with its neighbors. Motivated by this problem, we develop, and analyze, a novel accelerated primal-dual stochastic gradient method for general stochastic convex optimization problems with linear equality constraints. Then, we apply this method to the decen- tralized distributed optimization setting to obtain a new algorithm for the distributed semi-discrete regularized Wasserstein barycenter problem. Moreover, we show explicit non-asymptotic complexity for the proposed algorithm. Finally, we show the effectiveness of our method on the distributed computation of the regularized Wasserstein barycenter of univariate Gaussian and von Mises distributions, as well as some applications to image aggregation.
Limited Memory Kelley's Method Converges for Composite Convex and Submodular Objectives
Zhou, Song, Gupta, Swati, Udell, Madeleine
The original simplicial method (OSM), a variant of the classic Kelleyโs cutting plane method, has been shown to converge to the minimizer of a composite convex and submodular objective, though no rate of convergence for this method was known. Moreover, OSM is required to solve subproblems in each iteration whose size grows linearly in the number of iterations. We propose a limited memory version of Kelleyโs method (L-KM) and of OSM that requires limited memory (at most n+ 1 constraints for an n-dimensional problem) independent of the iteration. We prove convergence for L-KM when the convex part of the objective g is strongly convex and show it converges linearly when g is also smooth. Our analysis relies on duality between minimization of the composite convex and submodular objective and minimization of a convex function over the submodular base polytope. We introduce a limited memory version, L-FCFW, of the Fully-Corrective Frank-Wolfe (FCFW) method with approximate correction, to solve the dual problem. We show that L-FCFW and L-KM are dual algorithms that produce the same sequence of iterates; hence both converge linearly (when g is smooth and strongly convex) and with limited memory. We propose L-KM to minimize composite convex and submodular objectives; however, our results on L-FCFW hold for general polytopes and may be of independent interest.
Provably Correct Automatic Sub-Differentiation for Qualified Programs
Kakade, Sham M., Lee, Jason D.
The \emph{Cheap Gradient Principle}~\citep{Griewank:2008:EDP:1455489} --- the computational cost of computing a $d$-dimensional vector of partial derivatives of a scalar function is nearly the same (often within a factor of $5$) as that of simply computing the scalar function itself --- is of central importance in optimization; it allows us to quickly obtain (high-dimensional) gradients of scalar loss functions which are subsequently used in black box gradient-based optimization procedures. The current state of affairs is markedly different with regards to computing sub-derivatives: widely used ML libraries, including TensorFlow and PyTorch, do \emph{not} correctly compute (generalized) sub-derivatives even on simple differentiable examples. This work considers the question: is there a \emph{Cheap Sub-gradient Principle}? Our main result shows that, under certain restrictions on our library of non-smooth functions (standard in non-linear programming), provably correct generalized sub-derivatives can be computed at a computational cost that is within a (dimension-free) factor of $6$ of the cost of computing the scalar function itself.
Decentralize and Randomize: Faster Algorithm for Wasserstein Barycenters
Dvurechenskii, Pavel, Dvinskikh, Darina, Gasnikov, Alexander, Uribe, Cesar, Nedich, Angelia
We study the decentralized distributed computation of discrete approximations for the regularized Wasserstein barycenter of a finite set of continuous probability measures distributedly stored over a network. We assume there is a network of agents/machines/computers, and each agent holds a private continuous probability measure and seeks to compute the barycenter of all the measures in the network by getting samples from its local measure and exchanging information with its neighbors. Motivated by this problem, we develop, and analyze, a novel accelerated primal-dual stochastic gradient method for general stochastic convex optimization problems with linear equality constraints. Then, we apply this method to the decen- tralized distributed optimization setting to obtain a new algorithm for the distributed semi-discrete regularized Wasserstein barycenter problem. Moreover, we show explicit non-asymptotic complexity for the proposed algorithm. Finally, we show the effectiveness of our method on the distributed computation of the regularized Wasserstein barycenter of univariate Gaussian and von Mises distributions, as well as some applications to image aggregation.
Multi-objective Maximization of Monotone Submodular Functions with Cardinality Constraint
We consider the problem of multi-objective maximization of monotone submodular functions subject to cardinality constraint, often formulated as $\max_{|A|=k}\min_{i\in\{1,\dots,m\}}f_i(A)$. While it is widely known that greedy methods work well for a single objective, the problem becomes much harder with multiple objectives. In fact, Krause et al.\ (2008) showed that when the number of objectives $m$ grows as the cardinality $k$ i.e., $m=\Omega(k)$, the problem is inapproximable (unless $P=NP$). On the other hand, when $m$ is constant Chekuri et al.\ (2010) showed a randomized $(1-1/e)-\epsilon$ approximation with runtime (number of queries to function oracle) $n^{m/\epsilon^3}$. %In fact, the result of Chekuri et al.\ (2010) is for the far more general case of matroid constant. We focus on finding a fast and practical algorithm that has (asymptotic) approximation guarantees even when $m$ is super constant. We first modify the algorithm of Chekuri et al.\ (2010) to achieve a $(1-1/e)$ approximation for $m=o(\frac{k}{\log^3 k})$. This demonstrates a steep transition from constant factor approximability to inapproximability around $m=\Omega(k)$. Then using Multiplicative-Weight-Updates (MWU), we find a much faster $\tilde{O}(n/\delta^3)$ time asymptotic $(1-1/e)^2-\delta$ approximation. While the above results are all randomized, we also give a simple deterministic $(1-1/e)-\epsilon$ approximation with runtime $kn^{m/\epsilon^4}$. Finally, we run synthetic experiments using Kronecker graphs and find that our MWU inspired heuristic outperforms existing heuristics.
Optimization of Smooth Functions with Noisy Observations: Local Minimax Rates
Wang, Yining, Balakrishnan, Sivaraman, Singh, Aarti
We consider the problem of global optimization of an unknown non-convex smooth function with noisy zeroth-order feedback. We propose a local minimax framework to study the fundamental difficulty of optimizing smooth functions with adaptive function evaluations. We show that for functions with fast growth around their global minima, carefully designed optimization algorithms can identify a near global minimizer with many fewer queries than worst-case global minimax theory predicts. For the special case of strongly convex and smooth functions, our implied convergence rates match the ones developed for zeroth-order convex optimization problems. On the other hand, we show that in the worst case no algorithm can converge faster than the minimax rate of estimating an unknown functions in linf-norm. Finally, we show that non-adaptive algorithms, although optimal in a global minimax sense, do not attain the optimal local minimax rate.
A theory on the absence of spurious solutions for nonconvex and nonsmooth optimization
Josz, Cedric, Ouyang, Yi, Zhang, Richard, Lavaei, Javad, Sojoudi, Somayeh
We study the set of continuous functions that admit no spurious local optima (i.e. local minima that are not global minima) which we term global functions. They satisfy various powerful properties for analyzing nonconvex and nonsmooth optimization problems. For instance, they satisfy a theorem akin to the fundamental uniform limit theorem in the analysis regarding continuous functions. Global functions are also endowed with useful properties regarding the composition of functions and change of variables. Using these new results, we show that a class of non-differentiable nonconvex optimization problems arising in tensor decomposition applications are global functions. This is the first result concerning nonconvex methods for nonsmooth objective functions. Our result provides a theoretical guarantee for the widely-used $\ell_1$ norm to avoid outliers in nonconvex optimization.
Submodular Maximization via Gradient Ascent: The Case of Deep Submodular Functions
Bai, Wenruo, Noble, William Stafford, Bilmes, Jeff A.
We study the problem of maximizing deep submodular functions (DSFs) subject to a matroid constraint. DSFs are an expressive class of submodular functions that include, as strict subfamilies, the facility location, weighted coverage, and sums of concave composed with modular functions. We use a strategy similar to the continuous greedy approach, but we show that the multilinear extension of any DSF has a natural and computationally attainable concave relaxation that we can optimize using gradient ascent. Our results show a guarantee of $\max_{0<\delta<1}(1-\epsilon-\delta-e^{-\delta^2\Omega(k)})$ with a running time of $O(\nicefrac{n^2}{\epsilon^2})$ plus time for pipage rounding to recover a discrete solution, where $k$ is the rank of the matroid constraint. This bound is often better than the standard $1-1/e$ guarantee of the continuous greedy algorithm, but runs much faster. Our bound also holds even for fully curved ($c=1$) functions where the guarantee of $1-c/e$ degenerates to $1-1/e$ where $c$ is the curvature of $f$. We perform computational experiments that support our theoretical results.
Hessian-based Analysis of Large Batch Training and Robustness to Adversaries
Yao, Zhewei, Gholami, Amir, Lei, Qi, Keutzer, Kurt, Mahoney, Michael W.
Large batch size training of Neural Networks has been shown to incur accuracy loss when trained with the current methods. The exact underlying reasons for this are still not completely understood. Here, we study large batch size training through the lens of the Hessian operator and robust optimization. In particular, we perform a Hessian based study to analyze exactly how the landscape of the loss function changes when training with large batch size. We compute the true Hessian spectrum, without approximation, by back-propagating the second derivative. Extensive experiments on multiple networks show that saddle-points are not the cause for generalization gap of large batch size training, and the results consistently show that large batch converges to points with noticeably higher Hessian spectrum. Furthermore, we show that robust training allows one to favor flat areas, as points with large Hessian spectrum show poor robustness to adversarial perturbation. We further study this relationship, and provide empirical and theoretical proof that the inner loop for robust training is a saddle-free optimization problem \textit{almost everywhere}. We present detailed experiments with five different network architectures, including a residual network, tested on MNIST, CIFAR-10/100 datasets.