Goto

Collaborating Authors

 Optimization


Efficiently Bounding Optimal Solutions after Small Data Modification in Large-Scale Empirical Risk Minimization

arXiv.org Machine Learning

We study large-scale classification problems in changing environments where a small part of the dataset is modified, and the effect of the data modification must be quickly incorporated into the classifier. When the entire dataset is large, even if the amount of the data modification is fairly small, the computational cost of re-training the classifier would be prohibitively large. In this paper, we propose a novel method for efficiently incorporating such a data modification effect into the classifier without actually re-training it. The proposed method provides bounds on the unknown optimal classifier with the cost only proportional to the size of the data modification. We demonstrate through numerical experiments that the proposed method provides sufficiently tight bounds with negligible computational costs, especially when a small part of the dataset is modified in a large-scale classification problem.


The local convexity of solving systems of quadratic equations

arXiv.org Machine Learning

This paper considers the recovery of a rank $r$ positive semidefinite matrix $X X^T\in\mathbb{R}^{n\times n}$ from $m$ scalar measurements of the form $y_i := a_i^T X X^T a_i$ (i.e., quadratic measurements of $X$). Such problems arise in a variety of applications, including covariance sketching of high-dimensional data streams, quadratic regression, quantum state tomography, among others. A natural approach to this problem is to minimize the loss function $f(U) = \sum_i (y_i - a_i^TUU^Ta_i)^2$ which has an entire manifold of solutions given by $\{XO\}_{O\in\mathcal{O}_r}$ where $\mathcal{O}_r$ is the orthogonal group of $r\times r$ orthogonal matrices; this is {\it non-convex} in the $n\times r$ matrix $U$, but methods like gradient descent are simple and easy to implement (as compared to semidefinite relaxation approaches). In this paper we show that once we have $m \geq C nr \log^2(n)$ samples from isotropic gaussian $a_i$, with high probability {\em (a)} this function admits a dimension-independent region of {\em local strong convexity} on lines perpendicular to the solution manifold, and {\em (b)} with an additional polynomial factor of $r$ samples, a simple spectral initialization will land within the region of convexity with high probability. Together, this implies that gradient descent with initialization (but no re-sampling) will converge linearly to the correct $X$, up to an orthogonal transformation. We believe that this general technique (local convexity reachable by spectral initialization) should prove applicable to a broader class of nonconvex optimization problems.


Horizontally Scalable Submodular Maximization

arXiv.org Machine Learning

A variety of large-scale machine learning problems can be cast as instances of constrained submodular maximization. Existing approaches for distributed submodular maximization have a critical drawback: The capacity - number of instances that can fit in memory - must grow with the data set size. In practice, while one can provision many machines, the capacity of each machine is limited by physical constraints. We propose a truly scalable approach for distributed submodular maximization under fixed capacity. The proposed framework applies to a broad class of algorithms and constraints and provides theoretical guarantees on the approximation factor for any available capacity. We empirically evaluate the proposed algorithm on a variety of data sets and demonstrate that it achieves performance competitive with the centralized greedy solution.


Machine learning software increases cooling system optimization

@machinelearnbot

SEATTLE, February 11, 2015 โ€“ Optimum Energy, the leading provider of data-driven cooling and heating optimization solutions for enterprise facilities, today introduced OptiCxTM Dynamic Sequencing, a software optimization tool that learns how chillers perform over time in a variety of operating conditions, and uses this data to improve the overall plant efficiency by determining the most efficient chiller to run. "The OptiCx Platform is an award-winning approach with a growing base of committed customers, and now, with Dynamic Sequencing, it's taking a big step forward," said Ian Dempster, Optimum Energy's Senior Director of Product Innovation. "When combined with OptimumLOOP, this is the most powerful chiller optimization solution available, offering substantial reductions in energy and water use." Available as an add-on for customers with a subscription to the OptiCx PlatformTM, Dynamic Sequencing works in conjunction with OptimumLOOPTM, an operational module in the OptiCx Platform. OptimumLOOP uses relational control algorithms to determine operating setpoints and parameters to turn on or off an additional chiller in a plant.


Budgeted Optimization with Constrained Experiments

Journal of Artificial Intelligence Research

Motivated by a real-world problem, we study a novel budgeted optimization problem where the goal is to optimize an unknown function f(.) given a budget by requesting a sequence of samples from the function. In our setting, however, evaluating the function at precisely specified points is not practically possible due to prohibitive costs. Instead, we can only request constrained experiments. A constrained experiment, denoted by Q, specifies a subset of the input space for the experimenter to sample the function from. The outcome of Q includes a sampled experiment x, and its function output f(x). Importantly, as the constraints of Q become looser, the cost of fulfilling the request decreases, but the uncertainty about the location x increases. Our goal is to manage this trade-off by selecting a set of constrained experiments that best optimize f(.) within the budget. We study this problem in two different settings, the non-sequential (or batch) setting where a set of constrained experiments is selected at once, and the sequential setting where experiments are selected one at a time. We evaluate our proposed methods for both settings using synthetic and real functions. The experimental results demonstrate the efficacy of the proposed methods.


Minding the Gaps for Block Frank-Wolfe Optimization of Structured SVMs

arXiv.org Machine Learning

In this paper, we propose several improvements on the block-coordinate Frank-Wolfe (BCFW) algorithm from Lacoste-Julien et al. (2013) recently used to optimize the structured support vector machine (SSVM) objective in the context of structured prediction, though it has wider applications. The key intuition behind our improvements is that the estimates of block gaps maintained by BCFW reveal the block suboptimality that can be used as an adaptive criterion. First, we sample objects at each iteration of BCFW in an adaptive non-uniform way via gapbased sampling. Second, we incorporate pairwise and away-step variants of Frank-Wolfe into the block-coordinate setting. Third, we cache oracle calls with a cache-hit criterion based on the block gaps. Fourth, we provide the first method to compute an approximate regularization path for SSVM. Finally, we provide an exhaustive empirical evaluation of all our methods on four structured prediction datasets.


Bayesian optimization under mixed constraints with a slack-variable augmented Lagrangian

arXiv.org Machine Learning

An augmented Lagrangian (AL) can convert a constrained optimization problem into a sequence of simpler (e.g., unconstrained) problems, which are then usually solved with local solvers. Recently, surrogate-based Bayesian optimization (BO) sub-solvers have been successfully deployed in the AL framework for a more global search in the presence of inequality constraints; however, a drawback was that expected improvement (EI) evaluations relied on Monte Carlo. Here we introduce an alternative slack variable AL, and show that in this formulation the EI may be evaluated with library routines. The slack variables furthermore facilitate equality as well as inequality constraints, and mixtures thereof. We show how our new slack "ALBO" compares favorably to the original. Its superiority over conventional alternatives is reinforced on several mixed constraint examples.


Stochastic Variance Reduced Riemannian Eigensolver

arXiv.org Machine Learning

We study the stochastic Riemannian gradient algorithm for matrix eigen-decomposition. The state-of-the-art stochastic Riemannian algorithm requires the learning rate to decay to zero and thus suffers from slow convergence and sub-optimal solutions. In this paper, we address this issue by deploying the variance reduction (VR) technique of stochastic gradient descent (SGD). The technique was originally developed to solve convex problems in the Euclidean space. We generalize it to Riemannian manifolds and realize it to solve the non-convex eigen-decomposition problem. We are the first to propose and analyze the generalization of SVRG to Riemannian manifolds. Specifically, we propose the general variance reduction form, SVRRG, in the framework of the stochastic Riemannian gradient optimization. It's then specialized to the problem with eigensolvers and induces the SVRRG-EIGS algorithm. We provide a novel and elegant theoretical analysis on this algorithm. The theory shows that a fixed learning rate can be used in the Riemannian setting with an exponential global convergence rate guaranteed. The theoretical results make a significant improvement over existing studies, with the effectiveness empirically verified.


Structured Sparse Regression via Greedy Hard-Thresholding

arXiv.org Machine Learning

High dimensional problems where the regressor belongs to a small number of groups play a critical role in many machine learning and signal processing applications, such as computational biology [9] and multitask learning [14]. In most of these cases, the groups overlap, i.e., the same feature can belong to multiple groups. For example, gene pathways overlap in computational biology applications, and parent-child pairs of wavelet transform coefficients overlap in signal processing applications. The existing state-of-the-art methods for solving such group sparsity structured regression problems can be categorized into two broad classes: a) convex relaxation based methods, b) iterative hard thresholding (IHT) or greedy methods. In practice, IHT methods tend to be significantly more scalable than the (group-)lasso style methods that solve a convex program. But, these methods require a certain projection operator which in general is NPhard to compute and often certain simple heuristics are used with relatively weak theoretical guarantees. Moreover, existing guarantees for both classes of methods require relatively restrictive assumptions on the data, like Restricted Isometry Property or variants thereof [2, 8, 18, 14], that are unlikely to hold in most common applications. In fact, even under such settings, the group sparsity based convex programs offer at most polylogarithmic gains over standard sparsity based methods [18].


Global Optimality of Local Search for Low Rank Matrix Recovery

arXiv.org Machine Learning

We show that there are no spurious local minima in the non-convex factorized parametrization of low-rank matrix recovery from incoherent linear measurements. With noisy measurements we show all local minima are very close to a global optimum. Together with a curvature bound at saddle points, this yields a polynomial time global convergence guarantee for stochastic gradient descent {\em from random initialization}.