Goto

Collaborating Authors

 Optimization


Efficient Proximal Mapping Computation for Unitarily Invariant Low-Rank Inducing Norms

arXiv.org Machine Learning

Low-rank inducing unitarily invariant norms have been introduced to convexify problems with low-rank/sparsity constraint. They are the convex envelope of a unitary invariant norm and the indicator function of an upper bounding rank constraint. The most well-known member of this family is the so-called nuclear norm. To solve optimization problems involving such norms with proximal splitting methods, efficient ways of evaluating the proximal mapping of the low-rank inducing norms are needed. This is known for the nuclear norm, but not for most other members of the low-rank inducing family. This work supplies a framework that reduces the proximal mapping evaluation into a nested binary search, in which each iteration requires the solution of a much simpler problem. This simpler problem can often be solved analytically as it is demonstrated for the so-called low-rank inducing Frobenius and spectral norms. Moreover, the framework allows to compute the proximal mapping of compositions of these norms with increasing convex functions and the projections onto their epigraphs. This has the additional advantage that we can also deal with compositions of increasing convex functions and low-rank inducing norms in proximal splitting methods.


CHOPT : Automated Hyperparameter Optimization Framework for Cloud-Based Machine Learning Platforms

arXiv.org Machine Learning

Deep neural networks (DNNs) have become an essential method for solving difficult tasks in computer vision, signal processing, and natural language processing (He et al., 2016; Choi et al., 2018; Han et al., 2017; Van Den Oord et al., 2016; Seo et al., 2016; Vaswani et al., 2017). As the capabilities of deep learning have expanded with more modular architectures and advanced optimization methods, the number of hyperparameters has increased in general. This increase of hyperparameter sizes makes it more difficult for a researcher to optimize a model, wasting a lot of human resources and potentially leading unfair comparisons. This reinforces the importance of efficient automated hyperparameter tuning methods and interfaces. To address this problem, several hyperparameter optimization (HyperOpt) methods have been proposed (Jaderberg et al., 2017; Falkner et al., 2018; Li et al., 2017). These methods have many advantages such as strong final performance, parallelism, early stopping which significantly improve performance in terms of computing resource efficiency and optimization time.


Joint Nonparametric Precision Matrix Estimation with Confounding

arXiv.org Machine Learning

We consider the problem of precision matrix estimation where, due to extraneous confounding of the underlying precision matrix, the data are independent but not identically distributed. While such confounding occurs in many scientific problems, our approach is inspired by recent neuroscientific research suggesting that brain function, as measured using functional magnetic resonance imagine (fMRI), is susceptible to confounding by physiological noise such as breathing and subject motion. Following the scientific motivation, we propose a graphical model, which in turn motivates a joint nonparametric estimator. We provide theoretical guarantees for the consistency and the convergence rate of the proposed estimator. In addition, we demonstrate that the optimization of the proposed estimator can be transformed into a series of linear programming problems, and thus be efficiently solved in parallel. Empirical results are presented using simulated and real brain imaging data, which suggest that our approach improves precision matrix estimation, as compared to baselines, when confounding is present.


Maximizing Monotone DR-submodular Continuous Functions by Derivative-free Optimization

arXiv.org Machine Learning

In this paper, we study the problem of monotone (weakly) DR-submodular continuous maximization. While previous methods require the gradient information of the objective function, we propose a derivative-free algorithm LDGM for the first time. We define $\beta$ and $\alpha$ to characterize how close a function is to continuous DR-submodulr and submodular, respectively. Under a convex polytope constraint, we prove that LDGM can achieve a $(1-e^{-\beta}-\epsilon)$-approximation guarantee after $O(1/\epsilon)$ iterations, which is the same as the best previous gradient-based algorithm. Moreover, in some special cases, a variant of LDGM can achieve a $((\alpha/2)(1-e^{-\alpha})-\epsilon)$-approximation guarantee for (weakly) submodular functions. We also compare LDGM with the gradient-based algorithm Frank-Wolfe under noise, and show that LDGM can be more robust. Empirical results on budget allocation verify the effectiveness of LDGM.


Co-manifold learning with missing data

arXiv.org Machine Learning

Representation learning is typically applied to only one mode of a data matrix, either its rows or columns. Yet in many applications, there is an underlying geometry to both the rows and the columns. We propose utilizing this coupled structure to perform co-manifold learning: uncovering the underlying geometry of both the rows and the columns of a given matrix, where we focus on a missing data setting. Our unsupervised approach consists of three components. We first solve a family of optimization problems to estimate a complete matrix at multiple scales of smoothness. We then use this collection of smooth matrix estimates to compute pairwise distances on the rows and columns based on a new multi-scale metric that implicitly introduces a coupling between the rows and the columns. Finally, we construct row and column representations from these multi-scale metrics. We demonstrate that our approach outperforms competing methods in both data visualization and clustering.


An Optimal Control Approach to Sequential Machine Teaching

arXiv.org Machine Learning

Given a sequential learning algorithm and a target model, sequential machine teaching aims to find the shortest training sequence to drive the learning algorithm to the target model. We present the first principled way to find such shortest training sequences. Our key insight is to formulate sequential machine teaching as a time-optimal control problem. This allows us to solve sequential teaching by leveraging key theoretical and computational tools developed over the past 60 years in the optimal control community. Specifically, we study the Pontryagin Maximum Principle, which yields a necessary condition for optimality of a training sequence. We present analytic, structural, and numerical implications of this approach on a case study with a least-squares loss function and gradient descent learner. We compute optimal training sequences for this problem, and although the sequences seem circuitous, we find that they can vastly outperform the best available heuristics for generating training sequences.


Minimizing Sum of Non-Convex but Piecewise log-Lipschitz Functions using Coresets

arXiv.org Machine Learning

We suggest a new optimization technique for minimizing the sum $\sum_{i=1}^n f_i(x)$ of $n$ non-convex real functions that satisfy a property that we call piecewise log-Lipschitz. This is by forging links between techniques in computational geometry, combinatorics and convex optimization. Example applications include the first constant-factor approximation algorithms whose running-time is polynomial in $n$ for the following fundamental problems: (i) Constrained $\ell_z$ Linear Regression: Given $z>0$, $n$ vectors $p_1,\cdots,p_n$ on the plane, and a vector $b\in\mathbb{R}^n$, compute a unit vector $x$ and a permutation $\pi:[n]\to[n]$ that minimizes $\sum_{i=1}^n |p_ix-b_{\pi(i)}|^z$. (ii) Points-to-Lines alignment: Given $n$ lines $\ell_1,\cdots,\ell_n$ on the plane, compute the matching $\pi:[n]\to[n]$ and alignment (rotation matrix $R$ and a translation vector $t$) that minimize the sum of Euclidean distances \[ \sum_{i=1}^n \mathrm{dist}(Rp_i-t,\ell_{\pi(i)})^z \] between each point to its corresponding line. These problems are open even if $z=1$ and the matching $\pi$ is given. In this case, the running time of our algorithms reduces to $O(n)$ using core-sets that support: streaming, dynamic, and distributed parallel computations (e.g. on the cloud) in poly-logarithmic update time. Generalizations for handling e.g. outliers or pseudo-distances such as $M$-estimators for these problems are also provided. Experimental results show that our provable algorithms improve existing heuristics also in practice. A demonstration in the context of Augmented Reality show how such algorithms may be used in real-time systems.


Learning-Theoretic Foundations of Algorithm Configuration for Combinatorial Partitioning Problems

arXiv.org Artificial Intelligence

Max-cut, clustering, and many other partitioning problems that are of significant importance to machine learning and other scientific fields are NP-hard, a reality that has motivated researchers to develop a wealth of approximation algorithms and heuristics. Although the best algorithm to use typically depends on the specific application domain, a worst-case analysis is often used to compare algorithms. This may be misleading if worst-case instances occur infrequently, and thus there is a demand for optimization methods which return the algorithm configuration best suited for the given application's typical inputs. We address this problem for clustering, max-cut, and other partitioning problems, such as integer quadratic programming, by designing computationally efficient and sample efficient learning algorithms which receive samples from an application-specific distribution over problem instances and learn a partitioning algorithm with high expected performance. Our algorithms learn over common integer quadratic programming and clustering algorithm families: SDP rounding algorithms and agglomerative clustering algorithms with dynamic programming. For our sample complexity analysis, we provide tight bounds on the pseudodimension of these algorithm classes, and show that surprisingly, even for classes of algorithms parameterized by a single parameter, the pseudo-dimension is superconstant. In this way, our work both contributes to the foundations of algorithm configuration and pushes the boundaries of learning theory, since the algorithm classes we analyze consist of multi-stage optimization procedures and are significantly more complex than classes typically studied in learning theory.


Predictor-Corrector Policy Optimization

arXiv.org Machine Learning

We present a predictor-corrector framework, called PicCoLO, that can transform a first-order model-free reinforcement or imitation learning algorithm into a new hybrid method that leverages predictive models to accelerate policy learning. The new "PicCoLOed" algorithm optimizes a policy by recursively repeating two steps: In the Prediction Step, the learner uses a model to predict the unseen future gradient and then applies the predicted estimate to update the policy; in the Correction Step, the learner runs the updated policy in the environment, receives the true gradient, and then corrects the policy using the gradient error. Unlike previous algorithms, PicCoLO corrects for the mistakes of using imperfect predicted gradients and hence does not suffer from model bias. The development of PicCoLO is made possible by a novel reduction from predictable online learning to adversarial online learning, which provides a systematic way to modify existing first-order algorithms to achieve the optimal regret with respect to predictable information. We show, in both theory and simulation, that the convergence rate of several first-order model-free algorithms can be improved by PicCoLO.


Comparing Temporal Graphs Using Dynamic Time Warping

arXiv.org Machine Learning

The connections within many real-world networks change over time. Thus, there has been a recent boom in studying temporal graphs. Recognizing patterns in temporal graphs requires a similarity measure to compare different temporal graphs. To this end, we initiate the study of dynamic time warping (an established concept for mining time series data) on temporal graphs. We propose the dynamic temporal graph warping distance (dtgw) to determine the (dis-)similarity of two temporal graphs. Our novel measure is flexible and can be applied in various application domains. We show that computing the dtgw-distance is a challenging (NP-hard) optimization problem and identify some polynomial-time solvable special cases. Moreover, we develop a quadratic programming formulation and an efficient heuristic. Preliminary experiments indicate that the heuristic performs very well and that our concept yields meaningful results on real-world instances.