Goto

Collaborating Authors

 Optimization


Efficient, Certifiably Optimal High-Dimensional Clustering

arXiv.org Machine Learning

We consider SDP relaxation methods for data and variable clustering problems, which have been shown in the literature to have good statistical properties in a variety of settings, but remain intractable to solve in practice. In particular, we propose FORCE, a new algorithm to solve the Peng-Wei $K$-means SDP. Compared to the naive interior point method, our method reduces the computational complexity of solving the SDP from $\tilde{O}(d^7\log\epsilon^{-1})$ to $\tilde{O}(d^{6}K^{-2}\epsilon^{-1})$. Our method combines a primal first-order method with a dual optimality certificate search, which when successful, allows for early termination of the primal method. We show under certain data generating distributions that, with high probability, FORCE is guaranteed to find the optimal solution to the SDP relaxation and provide a certificate of exact optimality. As verified by our numerical experiments, this allows FORCE to solve the Peng-Wei SDP with dimensions in the hundreds in only tens of seconds. We also consider a variation of the Peng-Wei SDP for the case when $K$ is not known a priori and show that a slight modification of FORCE reduces the computational complexity of solving this problem as well: from $\tilde{O}(d^7\log\epsilon^{-1})$ using a standard SDP solver to $\tilde{O}(d^{4}\epsilon^{-1})$.


Structured Local Optima in Sparse Blind Deconvolution

arXiv.org Machine Learning

Blind deconvolution is a ubiquitous problem of recovering two unknown signals from their convolution. Unfortunately, this is an ill-posed problem in general. This paper focuses on the {\em short and sparse} blind deconvolution problem, where the one unknown signal is short and the other one is sparsely and randomly supported. This variant captures the structure of the unknown signals in several important applications. We assume the short signal to have unit $\ell^2$ norm and cast the blind deconvolution problem as a nonconvex optimization problem over the sphere. We demonstrate that (i) in a certain region of the sphere, every local optimum is close to some shift truncation of the ground truth, and (ii) for a generic short signal of length $k$, when the sparsity of activation signal $\theta\lesssim k^{-2/3}$ and number of measurements $m\gtrsim poly(k)$, a simple initialization method together with a descent algorithm which escapes strict saddle points recovers a near shift truncation of the ground truth kernel.


Learning convex bounds for linear quadratic control policy synthesis

arXiv.org Machine Learning

Learning to make decisions from observed data in dynamic environments remains a problem of fundamental importance in a number of fields, from artificial intelligence and robotics, to medicine and finance. This paper concerns the problem of learning control policies for unknown linear dynamical systems so as to maximize a quadratic reward function. We present a method to optimize the expected value of the reward over the posterior distribution of the unknown system parameters, given data. The algorithm involves sequential convex programing, and enjoys reliable local convergence and robust stability guarantees. Numerical simulations and stabilization of a real-world inverted pendulum are used to demonstrate the approach, with strong performance and robustness properties observed in both.


On Curvature-aided Incremental Aggregated Gradient Methods

arXiv.org Machine Learning

This paper studies an acceleration technique for incremental aggregated gradient methods which exploits curvature information for solving strongly convex finite sum optimization problems. These optimization problems of interest arise in large-scale learning applications relevant to machine learning systems. The proposed methods utilizes a novel curvature-aided gradient tracking technique to produce gradient estimates using the aids of Hessian information during computation. We propose and analyze two curvature-aided methods --- the first method, called curvature-aided incremental aggregated gradient (CIAG) method, can be developed from the standard gradient method and it computes an $\epsilon$-optimal solution using ${\cal O}( \kappa \log ( 1 / \epsilon ) )$ iterations for a small $\epsilon$; the second method, called accelerated CIAG (A-CIAG) method, incorporates Nesterov's acceleration into CIAG and requires ${\cal O}( \sqrt{\kappa} \log ( 1 / \epsilon ) )$ iterations for a small $\epsilon$, where $\kappa$ is the problem's condition number. Importantly, the asymptotic convergence rates above are the same as those of the full gradient and accelerated full gradient methods, respectively, and they are independent of the number of component functions involved. The proposed methods are significantly faster than the state-of-the-art methods, especially for large-scale problems with a massive amount of data. The source codes are available at https://github.com/hoitowai/ciag/


Analysis of Fast Structured Dictionary Learning

arXiv.org Machine Learning

Sparsity-based models and techniques have been exploited in many signal processing and imaging applications. Data-driven methods based on dictionary and transform learning enable learning rich image features from data, and can outperform analytical models. In particular, alternating optimization algorithms for dictionary learning have been popular. In this work, we focus on alternating minimization for a specific structured unitary operator learning problem, and provide a convergence analysis. While the algorithm converges to the critical points of the problem generally, our analysis establishes under mild assumptions, the local linear convergence of the algorithm to the underlying generating model of the data. Analysis and numerical simulations show that our assumptions hold well for standard probabilistic data models. In practice, the algorithm is robust to initialization.


Distributed Stochastic Gradient Tracking Methods

arXiv.org Machine Learning

In this paper, we study the problem of distributed multi-agent optimization over a network, where each agent possesses a local cost function that is smooth and strongly convex. The global objective is to find a common solution that minimizes the average of all cost functions. Assuming agents only have access to unbiased estimates of the gradients of their local cost functions, we consider a distributed stochastic gradient tracking method (DSGT) and a gossip-like stochastic gradient tracking method (GSGT). We show that, in expectation, the iterates generated by each agent are attracted to a neighborhood of the optimal solution, where they accumulate exponentially fast (under a constant stepsize choice). Under DSGT, the limiting (expected) error bounds on the distance of the iterates from the optimal solution decrease with the network size $n$, which is a comparable performance to a centralized stochastic gradient algorithm. Moreover, we show that when the network is well-connected, GSGT incurs lower communication cost than DSGT while maintaining a similar computational cost. Numerical example further demonstrates the effectiveness of the proposed methods.


A Flexible Multi-Objective Bayesian Optimization Approach using Random Scalarizations

arXiv.org Machine Learning

Many real world applications can be framed as multi-objective optimization problems, where we wish to simultaneously optimize for multiple criteria. Bayesian optimization techniques for the multi-objective setting are pertinent when the evaluation of the functions in question are expensive. Traditional methods for multi-objective optimization, both Bayesian and otherwise, are aimed at recovering the Pareto front of these objectives. However, we argue that recovering the entire Pareto front may not be aligned with our goals in practice. For example, while a practitioner might desire to identify Pareto optimal points, she may wish to focus only on a particular region of the Pareto front due to external considerations. In this work we propose an approach based on random scalarizations of the objectives. We demonstrate that our approach can focus its sampling on certain regions of the Pareto front while being flexible enough to sample from the entire Pareto front if required. Furthermore, our approach is less computationally demanding compared to other existing approaches. In this paper, we also analyse a notion of regret in the multi-objective setting and obtain sublinear regret bounds. We compare the proposed approach to other state-of-the-art approaches on both synthetic and real-life experiments. The results demonstrate superior performance of our proposed algorithm in terms of flexibility, scalability and regret.


Differential Properties of Sinkhorn Approximation for Learning with Wasserstein Distance

arXiv.org Machine Learning

Applications of optimal transport have recently gained remarkable attention thanks to the computational advantages of entropic regularization. However, in most situations the Sinkhorn approximation of the Wasserstein distance is replaced by a regularized version that is less accurate but easy to differentiate. In this work we characterize the differential properties of the original Sinkhorn distance, proving that it enjoys the same smoothness as its regularized version and we explicitly provide an efficient algorithm to compute its gradient. We show that this result benefits both theory and applications: on one hand, high order smoothness confers statistical guarantees to learning with Wasserstein approximations. On the other hand, the gradient formula allows us to efficiently solve learning and optimization problems in practice. Promising preliminary experiments complement our analysis.


Model-Driven Artificial Intelligence for Online Network Optimization

arXiv.org Artificial Intelligence

Future 5G wireless networks will rely on agile and automated network management, where the usage of diverse resources must be jointly optimized with surgical accuracy. A number of key wireless network functionalities (e.g., traffic steering, energy savings) give rise to hard optimization problems. What is more, high spatio-temporal traffic variability coupled with the need to satisfy strict per slice/service SLAs in modern networks, suggest that these problems must be constantly (re-)solved, to maintain close-to-optimal performance. To this end, in this paper we propose the framework of Online Network Optimization (ONO), which seeks to maintain both agile and efficient control over time, using an arsenal of data-driven, adaptive, and AI-based techniques. Since the mathematical tools and the studied regimes vary widely among these methodologies, a theoretical comparison is often out of reach. Therefore, the important question "what is the right ONO technique?" remains open to date. In this paper, we discuss the pros and cons of each technique and further attempt a direct quantitative comparison for a specific use case, using real data. Our results suggest that carefully combining the insights of problem modeling with state-of-the-art AI techniques provides significant advantages at reasonable complexity.


Generic CP-Supported CMSA for Binary Integer Linear Programs

arXiv.org Artificial Intelligence

Construct, Merge, Solve & Adapt (CMSA) [6] is a hybrid metaheuristic that can be applied to any combinatorial optimization problem for which is known a way of generating feasible solutions, and whose subproblems can be solved to optimality by a black-box solver. Moreover, note that CMSA is thought for those problem instances for which the application of 1 the standalone black-box solver is not feasible due to the problem instance size and/or difficulty. The main idea of CMSA is to generate reduced subinstances of the original problem instances, based on feasible solutions that are constructed at each iteration, and to solve these reduced instances by means of the black-box solver. Obviously, the parameters of CMSA have to be adjusted in order for the size of the reduced sub-instances to be such that the black-box solver can solve them efficiently. CMSA has been applied to several NPhard combinatorial optimization problems, including minimum common string partition [6, 4], the repetition-free longest common subsequence problem [5], and the multidimensional knapsack problem [15]. A possible disadvantage of CMSA is the fact that a problem-specific way of probabilistically generating solutions is used in the above-mentioned applications. Therefore, the goal of this paper is to design a CMSA variant that can be easily applied to different combinatorial optimization problems. One way of achieving this goal is the development of a solver for a quite general problem.