Optimization
End to end learning and optimization on graphs
Real-world applications often combine learning and optimization problems on graphs. For instance, our objective may be to cluster the graph in order to detect meaningful communities (or solve other common graph optimization problems such as facility location, maxcut, and so on). However, graphs or related attributes are often only partially observed, introducing learning problems such as link prediction which must be solved prior to optimization. We propose an approach to integrate a differentiable proxy for common graph optimization problems into training of machine learning models for tasks such as link prediction. This allows the model to focus specifically on the downstream task that its predictions will be used for.
Learning to Clear the Market
Shen, Weiran, Lahaie, Sรฉbastien, Leme, Renato Paes
The problem of market clearing is to set a price for an item such that quantity demanded equals quantity supplied. In this work, we cast the problem of predicting clearing prices into a learning framework and use the resulting models to perform revenue optimization in auctions and markets with contextual information. The economic intuition behind market clearing allows us to obtain fine-grained control over the aggressiveness of the resulting pricing policy, grounded in theory. To evaluate our approach, we fit a model of clearing prices over a massive dataset of bids in display ad auctions from a major ad exchange. The learned prices outperform other modeling techniques in the literature in terms of revenue and efficiency trade-offs. Because of the convex nature of the clearing loss function, the convergence rate of our method is as fast as linear regression.
Towards Unified Acceleration of High-Order Algorithms under H\"{o}lder Continuity and Uniform Convexity
In this paper, through a very intuitive {\em vanilla proximal method} perspective, we derive accelerated high-order optimization algorithms for minimizing a convex function that has H\"{o}lder continuous derivatives. In this general convex setting, we propose a {\em unified acceleration algorithm} with an iteration complexity that matches the lower iteration complexity bound given in \cite{grapiglia2019tensor}. If the function is further uniformly convex, we propose a {\em general restart scheme}. The iteration complexity of the algorithm matches existing lower bounds in most important cases. For practical implementation, we introduce a new and effective heuristic that significantly simplifies the binary search procedure required by the algorithm, which makes the algorithm in general settings as efficient as the special case \cite{grapiglia2019tensor}. On large-scale classification datasets, our algorithm demonstrates clear and consistent advantages of high-order acceleration methods over first-order ones, in terms of run-time complexity. Our formulation considers the more general composite setting in which the objective function may contain a second possibly non-smooth convex term. Our analysis and proofs are also applicable to the general case in which the high-order smoothness conditions are with respect to non-Euclidean norms.
Understanding Distributional Ambiguity via Non-robust Chance Constraint
Wu, Qi, Ma, Shumin, Leung, Cheuk Hang, Liu, Wei
The choice of the ambiguity radius is critical when an investor uses the distributionally robust approach to address the issue that the portfolio optimization problem is sensitive to the uncertainties of the asset return distribution. It cannot be set too large because the larger the size of the ambiguity set the worse the portfolio return. It cannot be too small either; otherwise, one loses the robust protection. This tradeoff demands a financial understanding of the ambiguity set. In this paper, we propose a non-robust interpretation of the distributionally robust optimization (DRO) problem. By relating the impact of an ambiguity set to the impact of a non-robust chance constraint, our interpretation allows investors to understand the size of the ambiguity set through parameters that are directly linked to investment performance. We first show that for general $\phi$-divergences, a DRO problem is asymptotically equivalent to a class of mean-deviation problem, where the ambiguity radius controls investor's risk preference. Based on this non-robust reformulation, we then show that when a boundedness constraint is added to the investment strategy, the DRO problem can be cast as a chance-constrained optimization (CCO) problem without distributional uncertainties. If the boundedness constraint is removed, the CCO problem is shown to perform uniformly better than the DRO problem, irrespective of the radius of the ambiguity set, the choice of the divergence measure, or the tail heaviness of the center distribution. Our results apply to both the widely-used Kullback-Leibler (KL) divergence which requires the distribution of the objective function to be exponentially bounded, as well as those more general divergence measures which allow heavy tail ones such as student $t$ and lognormal distributions.
A Generic Acceleration Framework for Stochastic Composite Optimization
Kulunchakov, Andrei, Mairal, Julien
In this paper, we introduce various mechanisms to obtain accelerated first-order stochastic optimization algorithms when the objective function is convex or strongly convex. Specifically, we extend the Catalyst approach originally designed for deterministic objectives to the stochastic setting. Given an optimization method with mild convergence guarantees for strongly convex problems, the challenge is to accelerate convergence to a noise-dominated region, and then achieve convergence with an optimal worst-case complexity depending on the noise variance of the gradients. A side contribution of our work is also a generic analysis that can handle inexact proximal operators, providing new insights about the robustness of stochastic algorithms when the proximal operator cannot be exactly computed.
A Direct $\tilde{O}(1/\epsilon)$ Iteration Parallel Algorithm for Optimal Transport
Jambulapati, Arun, Sidford, Aaron, Tian, Kevin
Optimal transportation, or computing the Wasserstein or ``earth mover's'' distance between two distributions, is a fundamental primitive which arises in many learning and statistical settings. We give an algorithm which solves this problem to additive $\epsilon$ with $\tilde{O}(1/\epsilon)$ parallel depth, and $\tilde{O}\left(n^2/\epsilon\right)$ work. Barring a breakthrough on a long-standing algorithmic open problem, this is optimal for first-order methods. Blanchet et. al. '18, Quanrud '19 obtained similar runtimes through reductions to positive linear programming and matrix scaling. However, these reduction-based algorithms use complicated subroutines which may be deemed impractical due to requiring solvers for second-order iterations (matrix scaling) or non-parallelizability (positive LP). The fastest practical algorithms run in time $\tilde{O}(\min(n^2 / \epsilon^2, n^{2.5} / \epsilon))$ (Dvurechensky et. al. '18, Lin et. al. '19). We bridge this gap by providing a parallel, first-order, $\tilde{O}(1/\epsilon)$ iteration algorithm without worse dependence on dimension, and provide preliminary experimental evidence that our algorithm may enjoy improved practical performance. We obtain this runtime via a primal-dual extragradient method, motivated by recent theoretical improvements to maximum flow (Sherman '17).
Clustering by Orthogonal NMF Model and Non-Convex Penalty Optimization
Wang, Shuai, Chang, Tsung-Hui, Cui, Ying, Pang, Jong-Shi
The non-negative matrix factorization (NMF) model with an additional orthogonality constraint on one of the factor matrices, called the orthogonal NMF (ONMF), has been found to provide improved clustering performance over the K-means. Solving the ONMF model is a challenging optimization problem due to the existence of both orthogonality and nonnegativity constraints, and most of the existing methods directly deal with the orthogonality constraint in its original form via various optimization techniques. In this paper, we propose a new ONMF based clustering formulation that equivalently transforms the orthogonality constraint into a set of norm-based non-convex equality constraints. We then apply a non-convex penalty (NCP) approach to add the non-convex equality constraints to the objective as penalty terms, leaving simple non-negativity constraints only in the penalized problem. One smooth penalty formulation and one non-smooth penalty formulation are respectively studied, and theoretical conditions for the penalized problems to provide feasible stationary solutions to the ONMF based clustering problem are presented. Experimental results based on both synthetic and real datasets are presented to show that the proposed NCP methods are computationally time efficient, and either match or outperform the existing K-means and ONMF based methods in terms of the clustering performance.
Parallel sequential Monte Carlo for stochastic gradient-free nonconvex optimization
Akyildiz, รmer Deniz, Crisan, Dan, Mรญguez, Joaquรญn
We introduce and analyze a parallel sequential Monte Carlo methodology for the numerical solution of optimization problems that involve the minimization of a cost function that consists of the sum of many individual components. The proposed scheme is a stochastic zeroth order optimization algorithm which demands only the capability to evaluate small subsets of components of the cost function. It can be depicted as a bank of samplers that generate particle approximations of several sequences of probability measures. These measures are constructed in such a way that they have associated probability density functions whose global maxima coincide with the global minima of the original cost function. The algorithm selects the best performing sampler and uses it to approximate a global minimum of the cost function. We prove analytically that the resulting estimator converges to a global minimum of the cost function almost surely and provide explicit convergence rates in terms of the number of generated Monte Carlo samples. We show, by way of numerical examples, that the algorithm can tackle cost functions with multiple minima or with broad "flat" regions which are hard to minimize using gradient-based techniques.
Truncated Cauchy Non-negative Matrix Factorization
Guan, Naiyang, Liu, Tongliang, Zhang, Yangmuzi, Tao, Dacheng, Davis, Larry S.
Non-negative matrix factorization (NMF) minimizes the Euclidean distance between the data matrix and its low rank approximation, and it fails when applied to corrupted data because the loss function is sensitive to outliers. In this paper, we propose a Truncated CauchyNMF loss that handle outliers by truncating large errors, and develop a Truncated CauchyNMF to robustly learn the subspace on noisy datasets contaminated by outliers. We theoretically analyze the robustness of Truncated CauchyNMF comparing with the competing models and theoretically prove that Truncated CauchyNMF has a generalization bound which converges at a rate of order $O(\sqrt{{\ln n}/{n}})$, where $n$ is the sample size. We evaluate Truncated CauchyNMF by image clustering on both simulated and real datasets. The experimental results on the datasets containing gross corruptions validate the effectiveness and robustness of Truncated CauchyNMF for learning robust subspaces.
Generalized Momentum-Based Methods: A Hamiltonian Perspective
Diakonikolas, Jelena, Jordan, Michael I.
We take a Hamiltonian-based perspective to generalize Nesterov's accelerated gradient descent and Polyak's heavy ball method to a broad class of momentum methods in the setting of (possibly) constrained minimization in Banach spaces. Our perspective leads to a generic and unifying non-asymptotic analysis of convergence of these methods in both the function value (in the setting of convex optimization) and in the norm of the gradient (in the setting of unconstrained, possibly nonconvex, optimization). The convergence analysis is intuitive and based on the conserved quantities of the time-dependent Hamiltonian that we introduce and that produces generalized momentum methods as its equations of motion.