Optimization
Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering
Lacoste-Julien, Simon, Lindsten, Fredrik, Bach, Francis
Recently, the Frank-Wolfe optimization algorithm was suggested as a procedure to obtain adaptive quadrature rules for integrals of functions in a reproducing kernel Hilbert space (RKHS) with a potentially faster rate of convergence than Monte Carlo integration (and "kernel herding" was shown to be a special case of this procedure). In this paper, we propose to replace the random sampling step in a particle filter by Frank-Wolfe optimization. By optimizing the position of the particles, we can obtain better accuracy than random or quasi-Monte Carlo sampling. In applications where the evaluation of the emission probabilities is expensive (such as in robot localization), the additional computational cost to generate the particles through optimization can be justified. Experiments on standard synthetic examples as well as on a robot localization task indicate indeed an improvement of accuracy over random and quasi-Monte Carlo sampling.
From Predictive to Prescriptive Analytics
Bertsimas, Dimitris, Kallus, Nathan
In this paper, we combine ideas from machine learning (ML) and operations research and management science (OR/MS) in developing a framework, along with specific methods, for using data to prescribe decisions in OR/MS problems. In a departure from other work on data-driven optimization and reflecting our practical experience with the data available in applications of OR/MS, we consider data consisting, not only of observations of quantities with direct effect on costs/revenues, such as demand or returns, but predominantly of observations of associated auxiliary quantities. The main problem of interest is a conditional stochastic optimization problem, given imperfect observations, where the joint probability distributions that specify the problem are unknown. We demonstrate that our proposed solution methods are generally applicable to a wide range of decision problems. We prove that they are computationally tractable and asymptotically optimal under mild conditions even when data is not independent and identically distributed (iid) and even for censored observations. As an analogue to the coefficient of determination $R^2$, we develop a metric $P$ termed the coefficient of prescriptiveness to measure the prescriptive content of data and the efficacy of a policy from an operations perspective. To demonstrate the power of our approach in a real-world setting we study an inventory management problem faced by the distribution arm of an international media conglomerate, which ships an average of 1 billion units per year. We leverage both internal data and public online data harvested from IMDb, Rotten Tomatoes, and Google to prescribe operational decisions that outperform baseline measures. Specifically, the data we collect, leveraged by our methods, accounts for an 88% improvement as measured by our coefficient of prescriptiveness.
A Multistage Stochastic Programming Approach to the Dynamic and Stochastic VRPTW - Extended version
Saint-Guillain, Michael, Deville, Yves, Solnon, Christine
We consider a dynamic vehicle routing problem with time windows and stochastic customers (DS-VRPTW), such that customers may request for services as vehicles have already started their tours. To solve this problem, the goal is to provide a decision rule for choosing, at each time step, the next action to perform in light of known requests and probabilistic knowledge on requests likelihood. We introduce a new decision rule, called Global Stochastic Assessment (GSA) rule for the DS-VRPTW, and we compare it with existing decision rules, such as MSA. In particular, we show that GSA fully integrates nonanticipativity constraints so that it leads to better decisions in our stochastic context. We describe a new heuristic approach for efficiently approximating our GSA rule. We introduce a new waiting strategy. Experiments on dynamic and stochastic benchmarks, which include instances of different degrees of dynamism, show that not only our approach is competitive with state-of-the-art methods, but also enables to compute meaningful offline solutions to fully dynamic problems where absolutely no a priori customer request is provided.
A PARTAN-Accelerated Frank-Wolfe Algorithm for Large-Scale SVM Classification
Frandi, Emanuele, Nanculef, Ricardo, Suykens, Johan A. K.
The Frank-Wolfe algorithm (hereafter FW) is a classical method for convex optimization that has seen a substantial revival in interest from researchers [1, 2, 3]. Recent results have shown that the family of FW algorithms enjoys powerful theoretical properties such as iteration complexity bounds that are independent of the problem size, provable primal-dual convergence rates, and sparsity guarantees that hold during the whole execution of the algorithm [4, 2]. Furthermore, several variants of the basic procedure exist which can improve the convergence rate and practical performance of the basic FW iteration [5, 6, 7, 8]. Finally, the fact that FW methods work with projection-free iterations is an essential advantage in applications such as matrix recovery, where a projection step (as needed, e.g., by proximal methods) has a super-linear complexity [2, 9]. As a result, FW is now considered a suitable choice for large-scale optimization 1 problems arising in several contexts such as Machine Learning, statistics, bioinformatics and other fields [10, 11, 12].
A scaled gradient projection method for Bayesian learning in dynamical systems
Bonettini, Silvia, Chiuso, Alessandro, Prato, Marco
A crucial task in system identification problems is the selection of the most appropriate model class, and is classically addressed resorting to cross-validation or using order selection criteria based on asymptotic arguments. As recently suggested in the literature, this can be addressed in a Bayesian framework, where model complexity is regulated by few hyperparameters, which can be estimated via marginal likelihood maximization. It is thus of primary importance to design effective optimization methods to solve the corresponding optimization problem. If the unknown impulse response is modeled as a Gaussian process with a suitable kernel, the maximization of the marginal likelihood leads to a challenging nonconvex optimization problem, which requires a stable and effective solution strategy. In this paper we address this problem by means of a scaled gradient projection algorithm, in which the scaling matrix and the steplength parameter play a crucial role to provide a meaningful solution in a computational time comparable with second order methods. In particular, we propose both a generalization of the split gradient approach to design the scaling matrix in the presence of box constraints, and an effective implementation of the gradient and objective function. The extensive numerical experiments carried out on several test problems show that our method is very effective in providing in few tenths of a second solutions of the problems with accuracy comparable with state-of-the-art approaches. Moreover, the flexibility of the proposed strategy makes it easily adaptable to a wider range of problems arising in different areas of machine learning, signal processing and system identification.
Feature selection for classification with class-separability strategy and data envelopment analysis
Zhang, Yishi, Yang, Chao, Yang, Anrong, Xiong, Chan, Zhou, Xingchi, Zhang, Zigang
In this paper, a novel feature selection method is presented, which is based on Class-Separability (CS) strategy and Data Envelopment Analysis (DEA). To better capture the relationship between features and the class, class labels are separated into individual variables and relevance and redundancy are explicitly handled on each class label. Super-efficiency DEA is employed to evaluate and rank features via their conditional dependence scores on all class labels, and the feature with maximum super-efficiency score is then added in the conditioning set for conditional dependence estimation in the next iteration, in such a way as to iteratively select features and get the final selected features. Eventually, experiments are conducted to evaluate the effectiveness of proposed method comparing with four state-of-the-art methods from the viewpoint of classification accuracy. Empirical results verify the feasibility and the superiority of proposed feature selection method. Keywords: Feature selection, classification, class-separability strategy, data envelopment analysis, super-efficiency 1. Introduction The explosion of large datasets in many fields poses unprecedented challenges to pattern recognition and data mining. Not only is the scale of samples getting larger, but also new types of data become prevalent. For example, tremendous new computer and Internet applications generate large amounts of types of data at an exponential rate in the world. It is thus realized that feature selection is an indispensable component [1]. Feature selection is a process of selecting a subset of original features according to certain criteria. It is an important and frequently used technique for dimension reduction.
Microscopic Advances with Large-Scale Learning: Stochastic Optimization for Cryo-EM
Punjani, Ali, Brubaker, Marcus A.
Determining the 3D structures of biological molecules is a key problem for both biology and medicine. Electron Cryomicroscopy (Cryo-EM) is a promising technique for structure estimation which relies heavily on computational methods to reconstruct 3D structures from 2D images. This paper introduces the challenging Cryo-EM density estimation problem as a novel application for stochastic optimization techniques. Structure discovery is formulated as MAP estimation in a probabilistic latent-variable model, resulting in an optimization problem to which an array of seven stochastic optimization methods are applied. The methods are tested on both real and synthetic data, with some methods recovering reasonable structures in less than one epoch from a random initialization. Complex quasi-Newton methods are found to converge more slowly than simple gradient-based methods, but all stochastic methods are found to converge to similar optima. This method represents a major improvement over existing methods as it is significantly faster and is able to converge from a random initialization.
Optimal computational and statistical rates of convergence for sparse nonconvex learning problems
Wang, Zhaoran, Liu, Han, Zhang, Tong
We provide theoretical analysis of the statistical and computational properties of penalized $M$-estimators that can be formulated as the solution to a possibly nonconvex optimization problem. Many important estimators fall in this category, including least squares regression with nonconvex regularization, generalized linear models with nonconvex regularization and sparse elliptical random design regression. For these problems, it is intractable to calculate the global solution due to the nonconvex formulation. In this paper, we propose an approximate regularization path-following method for solving a variety of learning problems with nonconvex objective functions. Under a unified analytic framework, we simultaneously provide explicit statistical and computational rates of convergence for any local solution attained by the algorithm. Computationally, our algorithm attains a global geometric rate of convergence for calculating the full regularization path, which is optimal among all first-order algorithms. Unlike most existing methods that only attain geometric rates of convergence for one single regularization parameter, our algorithm calculates the full regularization path with the same iteration complexity. In particular, we provide a refined iteration complexity bound to sharply characterize the performance of each stage along the regularization path. Statistically, we provide sharp sample complexity analysis for all the approximate local solutions along the regularization path. In particular, our analysis improves upon existing results by providing a more refined sample complexity bound as well as an exact support recovery result for the final estimator. These results show that the final estimator attains an oracle statistical property due to the usage of nonconvex penalty.
A* Sampling
Maddison, Chris J., Tarlow, Daniel, Minka, Tom
The problem of drawing samples from a discrete distribution can be converted into a discrete optimization problem. In this work, we show how sampling from a continuous distribution can be converted into an optimization problem over continuous space. Central to the method is a stochastic process recently described in mathematical statistics that we call the Gumbel process. We present a new construction of the Gumbel process and A* sampling, a practical generic sampling algorithm that searches for the maximum of a Gumbel process using A* search. We analyze the correctness and convergence time of A* sampling and demonstrate empirically that it makes more efficient use of bound and likelihood evaluations than the most closely related adaptive rejection sampling-based algorithms.
* HILL-CLIMBING: SOME REMARKS ON MULTIPLE OPTIMIZATION
Summary If we have a machine with 1000 knobs, how can we set them so as to minimize some output S of the machine, which represents, say, its error or departure from the behavior we want from it? We describe units, driven by S, which will each work a knob so that the whole system will tend toward an optimum. The units can be substantially identical, regardless of the actual structure of the machine. We exhibit three versions of these units, each with virtues and faults, and discuss -their behavior, with concrete and synthetic illustrative experiments. There are divers aspects of their joint behaviors shown, and some caveats about their use, especially in large numbers. We have not finished testing these units in large assemblies, but it is probably unimportant that the component parts of each unit work accurately or reliably.