Optimization
Spectral Non-Convex Optimization for Dimension Reduction with Hilbert-Schmidt Independence Criterion
Wu, Chieh, Miller, Jared, Chang, Yale, Sznaier, Mario, Dy, Jennifer
The Hilbert Schmidt Independence Criterion (HSIC) is a kernel dependence measure that has applications in various aspects of machine learning. Conveniently, the objectives of different dimensionality reduction applications using HSIC often reduce to the same optimization problem. However, the nonconvexity of the objective function arising from non-linear kernels poses a serious challenge to optimization efficiency and limits the potential of HSIC-based formulations. As a result, only linear kernels have been computationally tractable in practice. This paper proposes a spectral-based optimization algorithm that extends beyond the linear kernel. The algorithm identifies a family of suitable kernels and provides the first and second-order local guarantees when a fixed point is reached. Furthermore, we propose a principled initialization strategy, thereby removing the need to repeat the algorithm at random initialization points. Compared to state-of-the-art optimization algorithms, our empirical results on real data show a run-time improvement by as much as a factor of $10^5$ while consistently achieving lower cost and classification/clustering errors. The implementation source code is publicly available on https://github.com/endsley.
Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs
Shani, Lior, Efroni, Yonathan, Mannor, Shie
Trust region policy optimization (TRPO) is a popular and empirically successful policy search algorithm in Reinforcement Learning (RL) in which a surrogate problem, that restricts consecutive policies to be `close' to one another, is iteratively solved. Nevertheless, TRPO has been considered a heuristic algorithm inspired by Conservative Policy Iteration (CPI). We show that the adaptive scaling mechanism used in TRPO is in fact the natural "RL version" of traditional trust-region methods from convex analysis. We first analyze TRPO in the planning setting, in which we have access to the model and the entire state space. Then, we consider sample-based TRPO and establish $\tilde O(1/\sqrt{N})$ convergence rate to the global optimum. Importantly, the adaptive scaling mechanism allows us to analyze TRPO in {\em regularized MDPs} for which we prove fast rates of $\tilde O(1/N)$, much like results in convex optimization. This is the first result in RL of better rates when regularizing the instantaneous cost or reward.
Permutation Recovery from Multiple Measurement Vectors in Unlabeled Sensing
Zhang, Hang, Slawski, Martin, Li, Ping
In "Unlabeled Sensing", one observes a set of linear measurements of an underlying signal with incomplete or missing information about their ordering, which can be modeled in terms of an unknown permutation. Previous work on the case of a single noisy measurement vector has exposed two main challenges: 1) a high requirement concerning the \emph{signal-to-noise ratio} (snr), i.e., approximately of the order of $n^{5}$, and 2) a massive computational burden in light of NP-hardness in general. In this paper, we study the case of \emph{multiple} noisy measurement vectors (MMVs) resulting from a \emph{common} permutation and investigate to what extent the number of MMVs $m$ facilitates permutation recovery by "borrowing strength". The above two challenges have at least partially been resolved within our work. First, we show that a large stable rank of the signal significantly reduces the required snr which can drop from a polynomial in $n$ for $m = 1$ to a constant for $m = \Omega(\log n)$, where $m$ denotes the number of MMVs and $n$ denotes the number of measurements per MV. This bound is shown to be sharp and is associated with a phase transition phenomenon. Second, we propose computational schemes for recovering the unknown permutation in practice. For the "oracle case" with the known signal, the maximum likelihood (ML) estimator reduces to a linear assignment problem whose global optimum can be obtained efficiently. For the case in which both the signal and permutation are unknown, the problem is reformulated as a bi-convex optimization problem with an auxiliary variable, which can be solved by the Alternating Direction Method of Multipliers (ADMM). Numerical experiments based on the proposed computational schemes confirm the tightness of our theoretical analysis.
LSMI-Sinkhorn: Semi-supervised Squared-Loss Mutual Information Estimation with Optimal Transport
Liu, Yanbin, Yamada, Makoto, Tsai, Yao-Hung Hubert, Le, Tam, Salakhutdinov, Ruslan, Yang, Yi
Estimating mutual information is an important machine learning and statistics problem. To estimate the mutual information from data, a common practice is preparing a set of paired samples. However, in some cases, it is difficult to obtain a large number of data pairs. To address this problem, we propose squared-loss mutual information (SMI) estimation using a small number of paired samples and the available unpaired ones. We first represent SMI through the density ratio function, where the expectation is approximated by the samples from marginals and its assignment parameters. The objective is formulated using the optimal transport problem and quadratic programming. Then, we introduce the least-square mutual information-Sinkhorn algorithm (LSMI-Sinkhorn) for efficient optimization. Through experiments, we first demonstrate that the proposed method can estimate the SMI without a large number of paired samples. We also evaluate and show the effectiveness of the proposed LSMI-Sinkhorn on various types of machine learning problems such as image matching and photo album summarization.
A Nonconvex Approach for Exact and Efficient Multichannel Sparse Blind Deconvolution
Qu, Qing, Li, Xiao, Zhu, Zhihui
We study the multi-channel sparse blind deconvolution (MCS-BD) problem, whose task is to simultaneously recover a kernel $\mathbf a$ and multiple sparse inputs $\{\mathbf x_i\}_{i=1}^p$ from their circulant convolution $\mathbf y_i = \mathbf a \circledast \mathbf x_i $ ($i=1,\cdots,p$). We formulate the task as a nonconvex optimization problem over the sphere. Under mild statistical assumptions of the data, we prove that the vanilla Riemannian gradient descent (RGD) method, with random initializations, provably recovers both the kernel $\mathbf a$ and the signals $\{\mathbf x_i\}_{i=1}^p$ up to a signed shift ambiguity. In comparison with state-of-the-art results, our work shows significant improvements in terms of sample complexity and computational efficiency. Our theoretical results are corroborated by numerical experiments, which demonstrate superior performance of the proposed approach over the previous methods on both synthetic and real datasets.
Gradient Descent based Weight Learning for Grouping Problems: Application on Graph Coloring and Equitable Graph Coloring
Goudet, Olivier, Duval, Bรฉatrice, Hao, Jin-Kao
A grouping problem involves partitioning a set of items into mutually disjoint groups or clusters according to some guiding decision criteria and imperative constraints. Grouping problems have many relevant applications and are computationally difficult. In this work, we present a general weight learning based optimization framework for solving grouping problems. The central idea of our approach is to formulate the task of seeking a solution as a real-valued weight matrix learning problem that is solved by first order gradient descent. A practical implementation of this framework is proposed with tensor calculus in order to benefit from parallel computing on GPU devices. To show its potential for tackling difficult problems, we apply the approach to two typical and well-known grouping problems (graph coloring and equitable graph coloring). We present large computational experiments and comparisons on popular benchmarks and report improved best-known results (new upper bounds) for several large graphs.
Quantum Natural Gradient
Stokes, James, Izaac, Josh, Killoran, Nathan, Carleo, Giuseppe
Variational optimization of parametrized quantum circuits is an integral component for many hybrid quantum-classical algorithms, which are arguably the most promising applications of Noisy Intermediate-Scale Quantum (NISQ) computers [1]. Applications include the Variational Quantum Eigensolver (VQE) [2], Quantum Approximate Optimization Algorithm (QAOA) [3] and Quantum Neural Networks (QNNs) [4-6]. All the above are examples of stochastic optimization problems whereby one minimizes the expected value of a random cost function over a set of variational parameters, using noisy estimates of the cost and/or its gradient. In the quantum setting these estimates are obtained by repeated measurements of some Hermitian observables for a quantum state which depends on the variational parameters. A variety of optimization methods have been proposed in the variational quantum circuit literature for determining optimal variational parameters, including derivative-free (zeroth-order) methods such as Nelder-Mead, finite-differencing [7] or SPSA [8].
Distributionally Robust Language Modeling
Oren, Yonatan, Sagawa, Shiori, Hashimoto, Tatsunori B., Liang, Percy
Language models are generally trained on data spanning a wide range of topics (e.g., news, reviews, fiction), but they might be applied to an a priori unknown target distribution (e.g., restaurant reviews). In this paper, we first show that training on text outside the test distribution can degrade test performance when using standard maximum likelihood (MLE) training. To remedy this without the knowledge of the test distribution, we propose an approach which trains a model that performs well over a wide range of potential test distributions. In particular, we derive a new distributionally robust optimization (DRO) procedure which minimizes the loss of the model over the worst-case mixture of topics with sufficient overlap with the training distribution. Our approach, called topic conditional value at risk (topic CVaR), obtains a 5.5 point perplexity reduction over MLE when the language models are trained on a mixture of Yelp reviews and news and tested only on reviews.
Quasi-Newton Optimization Methods For Deep Learning Applications
Rafati, Jacob, Marcia, Roummel F.
Deep learning algorithms often require solving a highly non-linear and nonconvex unconstrained optimization problem. Methods for solving optimization problems in large-scale machine learning, such as deep learning and deep reinforcement learning (RL), are generally restricted to the class of first-order algorithms, like stochastic gradient descent (SGD). While SGD iterates are inexpensive to compute, they have slow theoretical convergence rates. Furthermore, they require exhaustive trial-and-error to fine-tune many learning parameters. Using second-order curvature information to find search directions can help with more robust convergence for non-convex optimization problems. However, computing Hessian matrices for large-scale problems is not computationally practical. Alternatively, quasi-Newton methods construct an approximate of the Hessian matrix to build a quadratic model of the objective function. Quasi-Newton methods, like SGD, require only first-order gradient information, but they can result in superlinear convergence, which makes them attractive alternatives to SGD. The limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) approach is one of the most popular quasi-Newton methods that construct positive definite Hessian approximations. In this chapter, we propose efficient optimization methods based on L-BFGS quasi-Newton methods using line search and trust-region strategies. Our methods bridge the disparity between first- and second-order methods by using gradient information to calculate low-rank updates to Hessian approximations. We provide formal convergence analysis of these methods as well as empirical results on deep learning applications, such as image classification tasks and deep reinforcement learning on a set of ATARI 2600 video games. Our results show a robust convergence with preferred generalization characteristics as well as fast training time.
Xeggora: Exploiting Immune-to-Evidence Symmetries with Full Aggregation in Statistical Relational Models
Amirian, Mohammad Mahdi, Shiry Ghidary, Saeed
We present improvements in maximum a-posteriori inference for Markov Logic, a widely used SRL formalism. Inferring the most probable world for Markov Logic is NP-hard in general. Several approaches, including Cutting Plane Aggregation (CPA), perform inference through translation to Integer Linear Programs. Aggregation exploits context-specific symmetries independently of evidence and reduces the size of the program. We illustrate much more symmetries occurring in long ground clauses that are ignored by CPA and can be exploited by higher-order aggregations. We propose Full-Constraint-Aggregation, a superior algorithm to CPA which exploits the ignored symmetries via a lifted translation method and some constraint relaxations. RDBMS and heuristic techniques are involved to improve the overall performance. We introduce Xeggora as an evolutionary extension of RockIt, the query engine that uses CPA. Xeggora evaluation on real-world benchmarks shows progress in efficiency compared to RockIt especially for models with long formulas.