Optimization
A Hypergradient Approach to Robust Regression without Correspondence
Xie, Yujia, Mao, Yixiu, Zuo, Simiao, Xu, Hongteng, Ye, Xiaojing, Zhao, Tuo, Zha, Hongyuan
We consider a regression problem, where the correspondence between input and output data is not available. Such shuffled data is commonly observed in many real world problems. Taking flow cytometry as an example, the measuring instruments are unable to preserve the correspondence between the samples and the measurements. Due to the combinatorial nature, most of existing methods are only applicable when the sample size is small, and limited to linear regression models. To overcome such bottlenecks, we propose a new computational framework - ROBOT- for the shuffled regression problem, which is applicable to large data and complex models. Specifically, we propose to formulate the regression without correspondence as a continuous optimization problem. Then by exploiting the interaction between the regression model and the data correspondence, we propose to develop a hypergradient approach based on differentiable programming techniques. Such a hypergradient approach essentially views the data correspondence as an operator of the regression, and therefore allows us to find a better descent direction for the model parameter by differentiating through the data correspondence. ROBOT is quite general, and can be further extended to the inexact correspondence setting, where the input and output data are not necessarily exactly aligned. Thorough numerical experiments show that ROBOT achieves better performance than existing methods in both linear and nonlinear regression tasks, including real-world applications such as flow cytometry and multi-object tracking.
Doubly Stochastic Subspace Clustering
Lim, Derek, Vidal, René, Haeffele, Benjamin D.
Many state-of-the-art subspace clustering methods follow a two-step process by first constructing an affinity matrix between data points and then applying spectral clustering to this affinity. Most of the research into these methods focuses on the first step of generating the affinity matrix, which often exploits the self-expressive property of linear subspaces, with little consideration typically given to the spectral clustering step that produces the final clustering. Moreover, existing methods obtain the affinity by applying ad-hoc postprocessing steps to the self-expressive representation of the data, and this postprocessing can have a significant impact on the subsequent spectral clustering step. In this work, we propose to unify these two steps by jointly learning both a self-expressive representation of the data and an affinity matrix that is well-normalized for spectral clustering. In the proposed model, we constrain the affinity matrix to be doubly stochastic, which results in a principled method for affinity matrix normalization while also exploiting the known benefits of doubly stochastic normalization in spectral clustering. While our proposed model is non-convex, we give a convex relaxation that is provably equivalent in many regimes; we also develop an efficient approximation to the full model that works well in practice. Experiments show that our method achieves state-of-the-art subspace clustering performance on many common datasets in computer vision.
Persistent Reductions in Regularized Loss Minimization for Variable Selection
In the context of regularized loss minimization with polyhedral gauges, we show that for a broad class of loss functions (possibly non-smooth and non-convex) and under a simple geometric condition on the input data it is possible to efficiently identify a subset of features which are guaranteed to have zero coefficients in all optimal solutions in all problems with loss functions from said class, before any iterative optimization has been performed for the original problem. This procedure is standalone, takes only the data as input, and does not require any calls to the loss function. Therefore, we term this procedure as a persistent reduction for the aforementioned class of regularized loss minimization problems. This reduction can be efficiently implemented via an extreme ray identification subroutine applied to a polyhedral cone formed from the datapoints. We employ an existing output-sensitive algorithm for extreme ray identification which makes our guarantee and algorithm applicable in ultra-high dimensional problems.
Optimal Algorithms for Convex Nested Stochastic Composite Optimization
Recently, convex nested stochastic composite optimization (NSCO) has received considerable attention for its application in reinforcement learning and risk-averse optimization. However, In the current literature, there exists a significant gap in the iteration complexities between these NSCO problems and other simpler stochastic composite optimization problems (e.g., sum of smooth and nonsmooth functions) without the nested structure. In this paper, we close the gap by reformulating a class of convex NSCO problems as "$\min\max\ldots \max$" saddle point problems under mild assumptions and proposing two primal-dual type algorithms with the optimal $\mathcal{O}\{1/\epsilon^2\}$ (resp., $\mathcal{O}\{1/\epsilon\}$) complexity for solving nested (resp., strongly) convex problems. More specifically, for the often-considered two-layer smooth-nonsmooth problem, we introduce a simple vanilla stochastic sequential dual (SSD) algorithm which can be implemented purely in the primal form. For the multi-layer problem, we propose a general stochastic sequential dual framework. The framework consists of modular dual updates for different types of functions (smooth, smoothable, and non-smooth, etc.), so that it can handle a more general composition of layer functions. Moreover, we present modular convergence proofs to show that the complexity of the general SSD is optimal with respect to nearly all the problem parameters.
Distributed Variational Bayesian Algorithms Over Sensor Networks
Distributed inference/estimation in Bayesian framework in the context of sensor networks has recently received much attention due to its broad applicability. The variational Bayesian (VB) algorithm is a technique for approximating intractable integrals arising in Bayesian inference. In this paper, we propose two novel distributed VB algorithms for general Bayesian inference problem, which can be applied to a very general class of conjugate-exponential models. In the first approach, the global natural parameters at each node are optimized using a stochastic natural gradient that utilizes the Riemannian geometry of the approximation space, followed by an information diffusion step for cooperation with the neighbors. In the second method, a constrained optimization formulation for distributed estimation is established in natural parameter space and solved by alternating direction method of multipliers (ADMM). An application of the distributed inference/estimation of a Bayesian Gaussian mixture model is then presented, to evaluate the effectiveness of the proposed algorithms. Simulations on both synthetic and real datasets demonstrate that the proposed algorithms have excellent performance, which are almost as good as the corresponding centralized VB algorithm relying on all data available in a fusion center.
Memoization in Python: The Essence of Dynamic Programming
Dynamic programming is a method developed by Richard Bellman in 1950s. The main idea behind the dynamic programming is to break a complicated problem into smaller sub-problems in a recursive manner. In computer science and programming, the dynamic programming method is used to solve some optimization problems. The dynamic programming is a general concept and not special to a particular programming language. But, we will do the examples in Python.
Sparse Signal Reconstruction for Nonlinear Models via Piecewise Rational Optimization
Marmin, Arthur, Castella, Marc, Pesquet, Jean-Christophe, Duval, Laurent
We propose a method to reconstruct sparse signals degraded by a nonlinear distortion and acquired at a limited sampling rate. Our method formulates the reconstruction problem as a nonconvex minimization of the sum of a data fitting term and a penalization term. In contrast with most previous works which settle for approximated local solutions, we seek for a global solution to the obtained challenging nonconvex problem. Our global approach relies on the so-called Lasserre relaxation of polynomial optimization. We here specifically include in our approach the case of piecewise rational functions, which makes it possible to address a wide class of nonconvex exact and continuous relaxations of the $\ell_0$ penalization function. Additionally, we study the complexity of the optimization problem. It is shown how to use the structure of the problem to lighten the computational burden efficiently. Finally, numerical simulations illustrate the benefits of our method in terms of both global optimality and signal reconstruction.
Autonomous Graph Mining Algorithm Search with Best Speed/Accuracy Trade-off
Yoon, Minji, Gervet, Théophile, Hooi, Bryan, Faloutsos, Christos
Graph data is ubiquitous in academia and industry, from social networks to bioinformatics. The pervasiveness of graphs today has raised the demand for algorithms that can answer various questions: Which products would a user like to purchase given her order list? Which users are buying fake followers to increase their public reputation? Myriads of new graph mining algorithms are proposed every year to answer such questions - each with a distinct problem formulation, computational time, and memory footprint. This lack of unity makes it difficult for a practitioner to compare different algorithms and pick the most suitable one for a specific application. These challenges - even more severe for non-experts - create a gap in which state-of-the-art techniques developed in academic settings fail to be optimally deployed in real-world applications. To bridge this gap, we propose AUTOGM, an automated system for graph mining algorithm development. We first define a unified framework UNIFIEDGM that integrates various message-passing based graph algorithms, ranging from conventional algorithms like PageRank to graph neural networks. Then UNIFIEDGM defines a search space in which five parameters are required to determine a graph algorithm. Under this search space, AUTOGM explicitly optimizes for the optimal parameter set of UNIFIEDGM using Bayesian Optimization. AUTOGM defines a novel budget-aware objective function for the optimization to incorporate a practical issue - finding the best speed-accuracy trade-off under a computation budget - into the graph algorithm generation problem. Experiments on real-world benchmark datasets demonstrate that AUTOGM generates novel graph mining algorithms with the best speed/accuracy trade-off compared to existing models with heuristic parameters.
Cable Tree Wiring -- Benchmarking Solvers on a Real-World Scheduling Problem with a Variety of Precedence Constraints
Koehler, Jana, Bürgler, Joseph, Fontana, Urs, Fux, Etienne, Herzog, Florian, Pouly, Marc, Saller, Sophia, Salyaeva, Anastasia, Scheiblechner, Peter, Waelti, Kai
Cable trees are widely used in industrial products to transmit energy and information between different product parts. For example, cable trees are needed in cars to automate many previously mechanical functions such as moving seats or opening windows and to add new functions such as a voice-controlled navigation or an onboard entertainment system. It is thus not surprising that for example a car like the VW Golf 7 contains 14 cable trees with a total of 1633 cables. The manufacturing of cable trees usually relies on cheap manual labour performed in low-cost countries where humans plug cables into harnesses following a wiring plan. Only few automated manufacturing solutions exist, which rely on complex robotic machines. These machines execute a sequence of wiring operations that highly qualified technicians develop by analyzing the wiring plan. With the continuing tendency towards customer-specific and resource-efficient justin-time manufacturing, smaller batch sizes of cable trees need to be manufactured requiring a frequent change of wiring plans, for which wiring sequences should be derived instantly. Scaling up human expertise to such frequent changes is simply impossible, which explains a growing interest in the intelligent automated manufacturing of cable trees. This interest is also nourished by a further miniaturization of cable harnesses, which will make their manual manufacturing impossible.
Unsupervised learning of disentangled representations in deep restricted kernel machines with orthogonality constraints
Tonin, Francesco, Patrinos, Panagiotis, Suykens, Johan A. K.
We introduce Constr-DRKM, a deep kernel method for the unsupervised learning of disentangled data representations. We propose augmenting the original deep restricted kernel machine formulation for kernel PCA by orthogonality constraints on the latent variables to promote disentanglement and to make it possible to carry out optimization without first defining a stabilized objective. After illustrating an end-to-end training procedure based on a quadratic penalty optimization algorithm with warm start, we quantitatively evaluate the proposed method's effectiveness in disentangled feature learning. We demonstrate on four benchmark datasets that this approach performs similarly overall to $\beta$-VAE on a number of disentanglement metrics when few training points are available, while being less sensitive to randomness and hyperparameter selection than $\beta$-VAE. We also present a deterministic initialization of Constr-DRKM's training algorithm that significantly improves the reproducibility of the results. Finally, we empirically evaluate and discuss the role of the number of layers in the proposed methodology, examining the influence of each principal component in every layer and showing that components in lower layers act as local feature detectors capturing the broad trends of the data distribution, while components in deeper layers use the representation learned by previous layers and more accurately reproduce higher-level features.