Goto

Collaborating Authors

 Optimization


A Tight Bound of Hard Thresholding

arXiv.org Machine Learning

This paper is concerned with the hard thresholding technique which sets all but the $k$ largest absolute elements to zero. We establish a tight bound that quantitatively characterizes the deviation of the thresholded solution from a given signal. Our theoretical result is universal in the sense that it holds for all choices of parameters, and the underlying analysis only depends on fundamental arguments in mathematical optimization. We discuss the implications for the literature: Compressed Sensing. On account of the crucial estimate, we bridge the connection between restricted isometry property (RIP) and the sparsity parameter of $k$ for a vast volume of hard thresholding based algorithms, which renders an improvement on the RIP condition especially when the true sparsity is unknown. This suggests that in essence, many more kinds of sensing matrices or fewer measurements are admissible for the data acquisition procedure. Machine Learning. In terms of large-scale machine learning, a significant yet challenging problem is producing sparse solutions in online setting. In stark contrast to prior works that attempted the $\ell_1$ relaxation for promoting sparsity, we present a novel algorithm which performs hard thresholding in each iteration to ensure such parsimonious solutions. Equipped with the developed bound for hard thresholding, we prove global linear convergence for a number of prevalent statistical models under mild assumptions, even though the problem turns out to be non-convex.


DSCOVR: Randomized Primal-Dual Block Coordinate Algorithms for Asynchronous Distributed Optimization

arXiv.org Machine Learning

Machine learning with big data often involves large optimization models. For distributed optimization over a cluster of machines, frequent communication and synchronization of all model parameters (optimization variables) can be very costly. A promising solution is to use parameter servers to store different subsets of the model parameters, and update them asynchronously at different machines using local datasets. In this paper, we focus on distributed optimization of large linear models with convex loss functions, and propose a family of randomized primal-dual block coordinate algorithms that are especially suitable for asynchronous distributed implementation with parameter servers. In particular, we work with the saddle-point formulation of such problems which allows simultaneous data and model partitioning, and exploit its structure by doubly stochastic coordinate optimization with variance reduction (DSCOVR). Compared with other first-order distributed algorithms, we show that DSCOVR may require less amount of overall computation and communication, and less or no synchronization. We discuss the implementation details of the DSCOVR algorithms, and present numerical experiments on an industrial distributed computing system.


Subjectively Interesting Subgroup Discovery on Real-valued Targets

arXiv.org Machine Learning

Deriving insights from high-dimensional data is one of the core problems in data mining. The difficulty mainly stems from the fact that there are exponentially many variable combinations to potentially consider, and there are infinitely many if we consider weighted combinations, even for linear combinations. Hence, an obvious question is whether we can automate the search for interesting patterns and visualizations. In this paper, we consider the setting where a user wants to learn as efficiently as possible about real-valued attributes. For example, to understand the distribution of crime rates in different geographic areas in terms of other (numerical, ordinal and/or categorical) variables that describe the areas. We introduce a method to find subgroups in the data that are maximally informative (in the formal Information Theoretic sense) with respect to a single or set of real-valued target attributes. The subgroup descriptions are in terms of a succinct set of arbitrarily-typed other attributes. The approach is based on the Subjective Interestingness framework FORSIED to enable the use of prior knowledge when finding most informative non-redundant patterns, and hence the method also supports iterative data mining.


Iterative proportional scaling revisited: a modern optimization perspective

arXiv.org Machine Learning

Count data are ubiquitous in modern statistical applications. Such data are often cross-classified into contingency tables, where iterative proportional scaling (IPS) can be applied as a standard tool (Fienberg and Meyer, 2006). IPS was firstly introduced by Deming and Stephan (1940) to adjust a contingency table to obey prescribed column and row marginals, the problem of which is referred to as matrix raking nowadays. In general, IPS can be applied to Kullback-Leibler (KL) divergence minimization with linear constraints, and Poisson log-linear model fitting on multi-way tables (Ireland and Kullback, 1968; Bishop et al., 1975), and there is an interesting 1 duality between these two types of problems (Good, 1963; Csiszรกr, 1975). Theoretical studies regarding the convergence properties of IPS undergo a long history and we refer to Pukelsheim (2014) for a comprehensive survey.


Local Convergence of Proximal Splitting Methods for Rank Constrained Problems

arXiv.org Machine Learning

We analyze the local convergence of proximal splitting algorithms to solve optimization problems that are convex besides a rank constraint. For this, we show conditions under which the proximal operator of a function involving the rank constraint is locally identical to the proximal operator of its convex envelope, hence implying local convergence. The conditions imply that the non-convex algorithms locally converge to a solution whenever a convex relaxation involving the convex envelope can be expected to solve the non-convex problem.


Is it possible to train a neural network without backpropagation?

@machinelearnbot

The first two algorithms you mention (Nelder-Mead and Simulated Annealing) are generally considered pretty much obsolete in optimization circles, as there are much better alternatives which are both more reliable and less costly. Genetic algorithms covers a wide range, and some of these can be reasonable. However, in the broader class of derivative-free optimization algorithms, there are many which are significantly better than these "classics", as this has been an active area of research in recent decades. So, might some of these newer approaches be reasonable for deep learning? This is a nice paper which has many interesting insights into recent techniques.


Stochastic Runtime Analysis of a Cross Entropy Algorithm for Traveling Salesman Problems

arXiv.org Artificial Intelligence

This article analyzes the stochastic runtime of a Cross-Entropy Algorithm on two classes of traveling salesman problems. The algorithm shares main features of the famous Max-Min Ant System with iteration-best reinforcement. For simple instances that have a $\{1,n\}$-valued distance function and a unique optimal solution, we prove a stochastic runtime of $O(n^{6+\epsilon})$ with the vertex-based random solution generation, and a stochastic runtime of $O(n^{3+\epsilon}\ln n)$ with the edge-based random solution generation for an arbitrary $\epsilon\in (0,1)$. These runtimes are very close to the known expected runtime for variants of Max-Min Ant System with best-so-far reinforcement. They are obtained for the stronger notion of stochastic runtime, which means that an optimal solution is obtained in that time with an overwhelming probability, i.e., a probability tending exponentially fast to one with growing problem size. We also inspect more complex instances with $n$ vertices positioned on an $m\times m$ grid. When the $n$ vertices span a convex polygon, we obtain a stochastic runtime of $O(n^{3}m^{5+\epsilon})$ with the vertex-based random solution generation, and a stochastic runtime of $O(n^{2}m^{5+\epsilon})$ for the edge-based random solution generation. When there are $k = O(1)$ many vertices inside a convex polygon spanned by the other $n-k$ vertices, we obtain a stochastic runtime of $O(n^{4}m^{5+\epsilon}+n^{6k-1}m^{\epsilon})$ with the vertex-based random solution generation, and a stochastic runtime of $O(n^{3}m^{5+\epsilon}+n^{3k}m^{\epsilon})$ with the edge-based random solution generation. These runtimes are better than the expected runtime for the so-called $(\mu\!+\!\lambda)$ EA reported in a recent article, and again obtained for the stronger notion of stochastic runtime.


A Tutorial on Hawkes Processes for Events in Social Media

arXiv.org Machine Learning

This chapter provides an accessible introduction for point processes, and especially Hawkes processes, for modeling discrete, inter-dependent events over continuous time. We start by reviewing the definitions and the key concepts in point processes. We then introduce the Hawkes process, its event intensity function, as well as schemes for event simulation and parameter estimation. We also describe a practical example drawn from social media data - we show how to model retweet cascades using a Hawkes self-exciting process. We presents a design of the memory kernel, and results on estimating parameters and predicting popularity. The code and sample event data are available as an online appendix


DSOS and SDSOS Optimization: More Tractable Alternatives to Sum of Squares and Semidefinite Optimization

arXiv.org Machine Learning

In recent years, optimization theory has been greatly impacted by the advent of sum of squares (SOS) optimization. The reliance of this technique on large-scale semidefinite programs however, has limited the scale of problems to which it can be applied. In this paper, we introduce DSOS and SDSOS optimization as more tractable alternatives to sum of squares optimization that rely instead on linear and second order cone programs respectively. These are optimization problems over certain subsets of sum of squares polynomials (or equivalently subsets of positive semidefinite matrices), which can be of interest in general applications of semidefinite programming where scalability is a limitation. We show that some basic theorems from SOS optimization which rely on results from real algebraic geometry are still valid for DSOS and SDSOS optimization. Furthermore, we show with numerical experiments from diverse application areas---polynomial optimization, statistics and machine learning, derivative pricing, and control theory---that with reasonable tradeoffs in accuracy, we can handle problems at scales that are currently far beyond the reach of sum of squares approaches. Finally, we provide a review of recent techniques that bridge the gap between our DSOS/SDSOS approach and the SOS approach at the expense of additional running time. The appendix of the paper introduces an accompanying MATLAB package for DSOS and SDSOS optimization.


Enhancing Transparency of Black-box Soft-margin SVM by Integrating Data-based Prior Information

arXiv.org Machine Learning

Development of black-box modeling techniques, like support vector machine (SVM), neural networks, etc., has shown rather rapid in the past decades (Yuan et al., 2016; Zhao et al., 2015; Wu et al., 2013). This sort of techniques, compared to white-box modeling methods (also called mechanism-based modeling or first-principles modeling), works without any need of knowing the internal structure or details on variables interaction in systems considered, so they are suited to describe extremely complex objectives, such as human brain (Khosrowabadi et al., 2014), black hole (Grumiller et al., 2012), integrated industrial processes (Gao et al., 2012) and so on. Essentially, blackbox modeling is an input-output data-based approach, and the model precision mainly depends on data quality, model structure and parameters identification algorithm. In order to develop high-precision black-box models, it always needs reliable and representative data, smart mathematical treatment and efficient identification algorithms. All of these are challenging the development of the black-box modeling techniques.