Optimization
An interior-point stochastic approximation method and an L1-regularized delta rule
Carbonetto, Peter, Schmidt, Mark, Freitas, Nando D.
The stochastic approximation method is behind the solution to many important, actively-studied problems in machine learning. Despite its far-reaching application, there is almost no work on applying stochastic approximation to learning problems with constraints. The reason for this, we hypothesize, is that no robust, widely-applicable stochastic approximation method exists for handling such problems. We propose that interior-point methods are a natural solution. We establish the stability of a stochastic interior-point approximation method both analytically and empirically, and demonstrate its utility by deriving an on-line learning algorithm that also performs feature selection via L1 regularization.
Active Instance Sampling via Matrix Partition
Recently, batch-mode active learning has attracted a lot of attention. In this paper, we propose a novel batch-mode active learning approach that selects a batch of queries in each iteration by maximizing a natural form of mutual information criterion between the labeled and unlabeled instances. By employing a Gaussian process framework, this mutual information based instance selection problem can be formulated as a matrix partition problem. Although the matrix partition is an NP-hard combinatorial optimization problem, we show a good local solution can be obtained by exploiting an effective local optimization technique on the relaxed continuous optimization problem. The proposed active learning approach is independent of employed classification models.
Fast global convergence rates of gradient methods for high-dimensional statistical recovery
Agarwal, Alekh, Negahban, Sahand, Wainwright, Martin J.
Many statistical $M$-estimators are based on convex optimization problems formed by the weighted sum of a loss function with a norm-based regularizer. We analyze the convergence rates of first-order gradient methods for solving such problems within a high-dimensional framework that allows the data dimension $d$ to grow with (and possibly exceed) the sample size $n$. This high-dimensional structure precludes the usual global assumptions---namely, strong convexity and smoothness conditions---that underlie classical optimization analysis. We define appropriately restricted versions of these conditions, and show that they are satisfied with high probability for various statistical models. Under these conditions, our theory guarantees that Nesterov's first-order method \cite{Nesterov07} has a globally geometric rate of convergence up to the statistical precision of the model, meaning the typical Euclidean distance between the true unknown parameter $\theta *$ and the optimal solution $\widehat{\theta}$.
Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach
Luo, Dijun, Huang, Heng, Nie, Feiping, Ding, Chris H.
In many graph-based machine learning and data mining approaches, the quality of the graph is critical. However, in real-world applications, especially in semi-supervised learning and unsupervised learning, the evaluation of the quality of a graph is often expensive and sometimes even impossible, due the cost or the unavailability of ground truth. In this paper, we proposed a robust approach with convex optimization to forge'' a graph: with an input of a graph, to learn a graph with higher quality. Our major concern is that an ideal graph shall satisfy all the following constraints: non-negative, symmetric, low rank, and positive semidefinite. We develop a graph learning algorithm by solving a convex optimization problem and further develop an efficient optimization to obtain global optimal solutions with theoretical guarantees.
Multi-Task Multicriteria Hyperparameter Optimization
Akhmetzyanov, Kirill, Yuzhakov, Alexander
We present a new method for searching optimal hyperparameters among several tasks and several criteria. Multi-Task Multi Criteria method (MTMC) provides several Pareto-optimal solutions, among which one solution is selected with given criteria significance coefficients. The article begins with a mathematical formulation of the problem of choosing optimal hyperparameters. Then, the steps of the MTMC method that solves this problem are described. The proposed method is evaluated on the image classification problem using a convolutional neural network. The article presents optimal hyperparameters for various criteria significance coefficients.
A Random-Feature Based Newton Method for Empirical Risk Minimization in Reproducing Kernel Hilbert Space
Chang, Ting-Jui, Shahrampour, Shahin
In supervised learning using kernel methods, we encounter a large-scale finite-sum minimization over a reproducing kernel Hilbert space(RKHS). Often times large-scale finite-sum problems can be solved using efficient variants of Newton's method where the Hessian is approximated via sub-samples. In RKHS, however, the dependence of the penalty function to kernel makes standard sub-sampling approaches inapplicable, since the gram matrix is not readily available in a low-rank form. In this paper, we observe that for this class of problems, one can naturally use kernel approximation to speed up the Newton's method. Focusing on randomized features for kernel approximation, we provide a novel second-order algorithm that enjoys local superlinear convergence and global convergence in the high probability sense. The key to our analysis is showing that the approximated Hessian via random features preserves the spectrum of the original Hessian. We provide numerical experiments verifying the efficiency of our approach, compared to variants of sub-sampling methods.
Nearest Neighbor based Greedy Coordinate Descent
Dhillon, Inderjit S., Ravikumar, Pradeep K., Tewari, Ambuj
Increasingly, optimization problems in machine learning, especially those arising from high-dimensional statistical estimation, have a large number of variables. Modern statistical estimators developed over the past decade have statistical or sample complexity that depends only weakly on the number of parameters when there is some structure to the problem, such as sparsity. A central question is whether similar advances can be made in their computational complexity as well. In this paper, we propose strategies that indicate that such advances can indeed be made. In particular, we investigate the greedy coordinate descent algorithm, and note that performing the greedy step efficiently weakens the costly dependence on the problem size provided the solution is sparse.
Approximating Concavely Parameterized Optimization Problems
Giesen, Joachim, Mueller, Jens, Laue, Soeren, Swiercy, Sascha
We consider an abstract class of optimization problems that are parameterized concavely in a single parameter, and show that the solution path along the parameter can always be approximated with accuracy $\varepsilon 0$ by a set of size $O(1/\sqrt{\varepsilon})$. A lower bound of size $\Omega (1/\sqrt{\varepsilon})$ shows that the upper bound is tight up to a constant factor. We also devise an algorithm that calls a step-size oracle and computes an approximate path of size $O(1/\sqrt{\varepsilon})$. Finally, we provide an implementation of the oracle for soft-margin support vector machines, and a parameterized semi-definite program for matrix completion. Papers published at the Neural Information Processing Systems Conference.
Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions
Agarwal, Alekh, Negahban, Sahand, Wainwright, Martin J.
We develop and analyze stochastic optimization algorithms for problems in which the expected loss is strongly convex, and the optimum is (approximately) sparse. Previous approaches are able to exploit only one of these two structures, yielding a $\order(\pdim/T)$ convergence rate for strongly convex objectives in $\pdim$ dimensions and $\order(\sqrt{\spindex( \log\pdim)/T})$ convergence rate when the optimum is $\spindex$-sparse. Our algorithm is based on successively solving a series of $\ell_1$-regularized optimization problems using Nesterov's dual averaging algorithm. We establish that the error of our solution after $T$ iterations is at most $\order(\spindex(\log\pdim)/T)$, with natural extensions to approximate sparsity. Our results apply to locally Lipschitz losses including the logistic, exponential, hinge and least-squares losses.
Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
Wibisono, Andre, Wainwright, Martin J., Jordan, Michael I., Duchi, John C.
We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their finite-sample convergence rates. We show that if pairs of function values are available, algorithms that use gradient estimates based on random perturbations suffer a factor of at most $\sqrt{\dim}$ in convergence rate over traditional stochastic gradient methods, where $\dim$ is the dimension of the problem. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problem-dependent quantities: they cannot be improved by more than constant factors. Papers published at the Neural Information Processing Systems Conference.