Optimization
A Polynomial-Time Deterministic Approach to the Traveling Salesperson Problem
We propose a new polynomial-time deterministic algorithm that produces an approximated solution for the traveling salesperson problem. The proposed algorithm ranks cities based on their priorities calculated using a power function of means and standard deviations of their distances from other cities and then connects the cities to their neighbors in the order of their priorities. When connecting a city, a neighbor is selected based on their neighbors' priorities calculated as another power function that additionally includes their distance from the focal city to be connected. This repeats until all the cities are connected into a single loop. The time complexity of the proposed algorithm is $O(n^2)$, where $n$ is the number of cities. Numerical evaluation shows that, despite its simplicity, the proposed algorithm produces shorter tours with less time complexity than other conventional tour construction heuristics. The proposed algorithm can be used by itself or as an initial tour generator for other more complex heuristic optimization algorithms.
Outlier-robust moment-estimation via sum-of-squares
Kothari, Pravesh K., Steurer, David
We develop efficient algorithms for estimating low-degree moments of unknown distributions in the presence of adversarial outliers. The guarantees of our algorithms improve in many cases significantly over the best previous ones, obtained in recent works of Diakonikolas et al, Lai et al, and Charikar et al. We also show that the guarantees of our algorithms match information-theoretic lower-bounds for the class of distributions we consider. These improved guarantees allow us to give improved algorithms for independent component analysis and learning mixtures of Gaussians in the presence of outliers. Our algorithms are based on a standard sum-of-squares relaxation of the following conceptually-simple optimization problem: Among all distributions whose moments are bounded in the same way as for the unknown distribution, find the one that is closest in statistical distance to the empirical distribution of the adversarially-corrupted sample.
Truncated Variational Expectation Maximization
We derive a novel variational expectation maximization approach based on truncated variational distributions. Truncated distributions are proportional to exact posteriors within a subset of a discrete state space and equal zero otherwise. The novel variational approach is realized by first generalizing the standard variational EM framework to include variational distributions with exact (`hard') zeros. A fully variational treatment of truncated distributions then allows for deriving novel and mathematically grounded results, which in turn can be used to formulate novel efficient algorithms to optimize the parameters of probabilistic generative models. We find the free energies which correspond to truncated distributions to be given by concise and efficiently computable expressions, while update equations for model parameters (M-steps) remain in their standard form. Furthermore, we obtain generic expressions for expectation values w.r.t. truncated distributions. Based on these observations, we show how efficient and easily applicable meta-algorithms can be formulated that guarantee a monotonic increase of the free energy. Example applications of the here derived framework provide novel theoretical results and learning procedures for latent variable models as well as mixture models including procedures to tightly couple sampling and variational optimization approaches. Furthermore, by considering a special case of truncated variational distributions, we can cleanly and fully embed the well-known `hard EM' approaches into the variational EM framework, and we show that `hard EM' (for models with discrete latents) provably optimizes a lower free energy bound of the data log-likelihood.
Adaptive Stochastic Dual Coordinate Ascent for Conditional Random Fields
Priol, Rรฉmi Le, Touati, Ahmed, Lacoste-Julien, Simon
This work investigates training Conditional Random Fields (CRF) by Stochastic Dual Coordinate Ascent (SDCA). SDCA enjoys a linear convergence rate and a strong empirical performance for independent classification problems. However, it has never been used to train CRF. Yet it benefits from an exact line search with a single marginalization oracle call, unlike previous approaches. In this paper, we adapt SDCA to train CRF and we enhance it with an adaptive non-uniform sampling strategy. Our preliminary experiments suggest that this method matches state-of-the-art CRF optimization techniques.
Query-limited Black-box Attacks to Classifiers
Suya, Fnu, Tian, Yuan, Evans, David, Papotti, Paolo
We study black-box attacks on machine learning classifiers where each query to the model incurs some cost or risk of detection to the adversary. We focus explicitly on minimizing the number of queries as a major objective. Specifically, we consider the problem of attacking machine learning classifiers subject to a budget of feature modification cost while minimizing the number of queries, where each query returns only a class and confidence score. We describe an approach that uses Bayesian optimization to minimize the number of queries, and find that the number of queries can be reduced to approximately one tenth of the number needed through a random strategy for scenarios where the feature modification cost budget is low.
Non-convex Optimization for Machine Learning
Jain, Prateek, Kar, Purushottam
A vast majority of machine learning algorithms train their models and perform inference by solving optimization problems. In order to capture the learning and prediction problems accurately, structural constraints such as sparsity or low rank are frequently imposed or else the objective itself is designed to be a non-convex function. This is especially true of algorithms that operate in high-dimensional spaces or that train non-linear models such as tensor models and deep networks. The freedom to express the learning problem as a non-convex optimization problem gives immense modeling power to the algorithm designer, but often such problems are NP-hard to solve. A popular workaround to this has been to relax non-convex problems to convex ones and use traditional methods to solve the (convex) relaxed optimization problems. However this approach may be lossy and nevertheless presents significant challenges for large scale optimization. On the other hand, direct approaches to non-convex optimization have met with resounding success in several domains and remain the methods of choice for the practitioner, as they frequently outperform relaxation-based techniques - popular heuristics include projected gradient descent and alternating minimization. However, these are often poorly understood in terms of their convergence and other properties. This monograph presents a selection of recent advances that bridge a long-standing gap in our understanding of these heuristics. The monograph will lead the reader through several widely used non-convex optimization techniques, as well as applications thereof. The goal of this monograph is to both, introduce the rich literature in this area, as well as equip the reader with the tools and techniques needed to analyze these simple procedures for non-convex problems.
Statistical Inference for the Population Landscape via Moment Adjusted Stochastic Gradients
Modern statistical inference tasks often require iterative optimization methods to approximate the solution. Convergence analysis from optimization only tells us how well we are approximating the solution deterministically, but overlooks the sampling nature of the data. However, due to the randomness in the data, statisticians are keen to provide uncertainty quantification, or confidence, for the answer obtained after certain steps of optimization. Therefore, it is important yet challenging to understand the sampling distribution of the iterative optimization methods. This paper makes some progress along this direction by introducing a new stochastic optimization method for statistical inference, the moment adjusted stochastic gradient descent. We establish non-asymptotic theory that characterizes the statistical distribution of the iterative methods, with good optimization guarantee. On the statistical front, the theory allows for model misspecification, with very mild conditions on the data. For optimization, the theory is flexible for both the convex and non-convex cases. Remarkably, the moment adjusting idea motivated from "error standardization" in statistics achieves similar effect as Nesterov's acceleration in optimization, for certain convex problems as in fitting generalized linear models. We also demonstrate this acceleration effect in the non-convex setting through experiments.
Adversarial Structured Prediction for Multivariate Measures
Wang, Hong, Rezaei, Ashkan, Ziebart, Brian D.
Many predicted structured objects (e.g., sequences, matchings, trees) are evaluated using the F-score, alignment error rate (AER), or other multivariate performance measures. Since inductively optimizing these measures using training data is typically computationally difficult, empirical risk minimization of surrogate losses is employed, using, e.g., the hinge loss for (structured) support vector machines. These approximations often introduce a mismatch between the learner's objective and the desired application performance, leading to inconsistency. We take a different approach: adversarially approximate training data while optimizing the exact F-score or AER. Structured predictions under this formulation result from solving zero-sum games between a predictor seeking the best performance and an adversary seeking the worst while required to (approximately) match certain structured properties of the training data. We explore this approach for word alignment (AER evaluation) and named entity recognition (F-score evaluation) with linear-chain constraints.
Learning with Average Top-k Loss
Fan, Yanbo, Lyu, Siwei, Ying, Yiming, Hu, Bao-Gang
In this work, we introduce the {\em average top-$k$} (\atk) loss as a new aggregate loss for supervised learning, which is the average over the $k$ largest individual losses over a training dataset. We show that the \atk loss is a natural generalization of the two widely used aggregate losses, namely the average loss and the maximum loss, but can combine their advantages and mitigate their drawbacks to better adapt to different data distributions. Furthermore, it remains a convex function over all individual losses, which can lead to convex optimization problems that can be solved effectively with conventional gradient-based methods. We provide an intuitive interpretation of the \atk loss based on its equivalent effect on the continuous individual loss functions, suggesting that it can reduce the penalty on correctly classified data. We further give a learning theory analysis of \matk learning on the classification calibration of the \atk loss and the error bounds of \atk-SVM. We demonstrate the applicability of minimum average top-$k$ learning for binary classification and regression using synthetic and real datasets.
On Computationally Tractable Selection of Experiments in Measurement-Constrained Regression Models
Wang, Yining, Yu, Adams Wei, Singh, Aarti
We derive computationally tractable methods to select a small subset of experiment settings from a large pool of given design points. The primary focus is on linear regression models, while the technique extends to generalized linear models and Delta's method (estimating functions of linear regression models) as well. The algorithms are based on a continuous relaxation of an otherwise intractable combinatorial optimization problem, with sampling or greedy procedures as post-processing steps. Formal approximation guarantees are established for both algorithms, and numerical results on both synthetic and real-world data confirm the effectiveness of the proposed methods.