Goto

Collaborating Authors

 Optimization


DAGs with NO TEARS: Smooth Optimization for Structure Learning

arXiv.org Machine Learning

Estimating the structure of directed acyclic graphs (DAGs, also known as Bayesian networks) is a challenging problem since the search space of DAGs is combinatorial and scales superexponentially with the number of nodes. Existing approaches rely on various local heuristics for enforcing the acyclicity constraint and are not well-suited to general purpose optimization packages for their solution. In this paper, we introduce a fundamentally different strategy: We formulate the structure learning problem as a smooth, constrained optimization problem over real matrices that avoids this combinatorial constraint entirely. This is achieved by a novel characterization of acyclicity that is not only smooth but also exact. The resulting nonconvex, constrained program involves smooth functions whose gradients are easy to compute and only involve elementary matrix operations. By using existing black-box optimization routines, our method uses global search to find an optimal DAG and can be implemented in about 50 lines of Python and outperforms existing methods without imposing any structural constraints.


A Distributed Quasi-Newton Algorithm for Empirical Risk Minimization with Nonsmooth Regularization

arXiv.org Machine Learning

In this paper, we propose a communication- and computation- efficient distributed optimization algorithm using second- order information for solving ERM problems with a nonsmooth regularization term. Current second-order and quasi- Newton methods for this problem either do not work well in the distributed setting or work only for specific regularizers. Our algorithm uses successive quadratic approximations, and we describe how to maintain an approximation of the Hessian and solve subproblems efficiently in a distributed manner. The proposed method enjoys global linear convergence for a broad range of non-strongly convex problems that includes the most commonly used ERMs, thus requiring lower communication complexity. Empirical results also demonstrate that our method significantly improves on communication cost and running time over the current state-of-the-art methods.


Gauged Mini-Bucket Elimination for Approximate Inference

arXiv.org Machine Learning

Computing the partition function $Z$ of a discrete graphical model is a fundamental inference challenge. Since this is computationally intractable, variational approximations are often used in practice. Recently, so-called gauge transformations were used to improve variational lower bounds on $Z$. In this paper, we propose a new gauge-variational approach, termed WMBE-G, which combines gauge transformations with the weighted mini-bucket elimination (WMBE) method. WMBE-G can provide both upper and lower bounds on $Z$, and is easier to optimize than the prior gauge-variational algorithm. We show that WMBE-G strictly improves the earlier WMBE approximation for symmetric models including Ising models with no magnetic field. Our experimental results demonstrate the effectiveness of WMBE-G even for generic, nonsymmetric models.


Stable Robbins-Monro approximations through stochastic proximal updates

arXiv.org Machine Learning

The need for parameter estimation with massive data has reinvigorated interest in iterative estimation procedures. Stochastic approximations, such as stochastic gradient descent, are at the forefront of this recent development because they yield simple, generic, and extremely fast iterative estimation procedures. Such stochastic approximations, however, are often numerically unstable. As a consequence, current practice has turned to proximal operators, which can induce stable parameter updates within iterations. While the majority of classical iterative estimation procedures are subsumed by the framework of Robbins and Monro (1951), there is no such generalization for stochastic approximations with proximal updates. In this paper, we conceptualize a general stochastic approximation method with proximal updates. This method can be applied even in situations where the analytical form of the objective is not known, and so it generalizes many stochastic gradient procedures with proximal operators currently in use. Our theoretical analysis indicates that the proposed method has important stability benefits over the classical stochastic approximation method. Exact instantiations of the proposed method are challenging, but we show that approximate instantiations lead to procedures that are easy to implement, and still dominate classical procedures by achieving numerical stability without tradeoffs. This last advantage is akin to that seen in deterministic proximal optimization, where the framework is typically impossible to instantiate exactly, but where approximate instantiations lead to new optimization procedures that dominate classical ones.


Nonnegative Matrix Factorization for Signal and Data Analytics: Identifiability, Algorithms, and Applications

arXiv.org Machine Learning

Nonnegative matrix factorization (NMF) has become a workhorse for signal and data analytics, triggered by its model parsimony and interpretability. Perhaps a bit surprisingly, the understanding to its model identifiability---the major reason behind the interpretability in many applications such as topic mining and hyperspectral imaging---had been rather limited until recent years. Beginning from the 2010s, the identifiability research of NMF has progressed considerably: Many interesting and important results have been discovered by the signal processing (SP) and machine learning (ML) communities. NMF identifiability has a great impact on many aspects in practice, such as ill-posed formulation avoidance and performance-guaranteed algorithm design. On the other hand, there is no tutorial paper that introduces NMF from an identifiability viewpoint. In this paper, we aim at filling this gap by offering a comprehensive and deep tutorial on model identifiability of NMF as well as the connections to algorithms and applications. This tutorial will help researchers and graduate students grasp the essence and insights of NMF, thereby avoiding typical `pitfalls' that are often times due to unidentifiable NMF formulations. This paper will also help practitioners pick/design suitable factorization tools for their own problems.


Fast and Sample Efficient Inductive Matrix Completion via Multi-Phase Procrustes Flow

arXiv.org Machine Learning

Matrix completion method has been used in a wide range of applications such as collaborative filtering for recommendation (Koren et al., 2009), multi-label learning (Cabral et al., 2011) and clustering (Hsieh et al., 2012). In these applications, every entry is modeled as the inner product between factors corresponding to the row and column variables. For example, in movie recommendation, each row factor represents the latent representation of a user and each column factor represents the latent representation of a movie. In many applications of significant interest, besides the partially observed matrix, side information, in the form of features, is also available. These might correspond to demographic information (genders, occupation) for users or product information (genre, director) in a movie recommender system for example. With such features at hand, one can model an observation as a specific linear interaction between features to reduce the model complexity.


Computational Optimal Transport

arXiv.org Machine Learning

Optimal Transport (OT) is a mathematical gem at the interface between probability, analysis and optimization. The goal of that theory is to define geometric tools that are useful to compare probability distributions. Earlier contributions originated from Monge's work in the 18th century, to be later rediscovered under a different formalism by Tolstoi in the 1920's, Kantorovich, Hitchcock and Koopmans in the 1940's. The problem was solved numerically by Dantzig in 1949 and others in the 1950's within the framework of linear programming, paving the way for major industrial applications in the second half of the 20th century. OT was later rediscovered under a different light by analysts in the 90's, following important work by Brenier and others, as well as in the computer vision/graphics fields under the name of earth mover's distances. Recent years have witnessed yet another revolution in the spread of OT, thanks to the emergence of approximate solvers that can scale to sizes and dimensions that are relevant to data sciences. Thanks to this newfound scalability, OT is being increasingly used to unlock various problems in imaging sciences (such as color or texture processing), computer vision and graphics (for shape manipulation) or machine learning (for regression,classification and density fitting). This short book reviews OT with a bias toward numerical methods and their applications in data sciences, and sheds lights on the theoretical properties of OT that make it particularly useful for some of these applications.


Re-examination of Bregman functions and new properties of their divergences

arXiv.org Machine Learning

The Bregman divergence (Bregman distance, Bregman measure of distance) is a certain useful substitute for a distance, obtained from a well-chosen function (the "Bregman function"). Bregman functions and divergences have been extensively investigated during the last decades and have found applications in optimization, operations research, information theory, nonlinear analysis, machine learning and more. This paper reexamines various aspects related to the theory of Bregman functions and divergences. In particular, it presents many sufficient conditions which allow the construction of Bregman functions in a general setting and introduces new Bregman functions (such as a negative iterated log entropy). Moreover, it sheds new light on several known Bregman functions such as quadratic entropies, the negative Havrda-Charvรกt-Tsallis entropy, and the negative Boltzmann-Gibbs-Shannon entropy, and it shows that the negative Burg entropy, which is not a Bregman function according to the classical theory but nevertheless is known to have "Bregmanian properties", can, by our reexamination of the theory, be considered as a Bregman function. Our analysis yields several byproducts of independent interest such as the introduction of the concept of relative uniform convexity (a certain generalization of uniform convexity), new properties of uniformly and strongly convex functions, and results in Banach space theory.


On Polynomial Time PAC Reinforcement Learning with Rich Observations

arXiv.org Machine Learning

We study episodic reinforcement learning (RL) when the observations may be realistically rich, such as images or text. We aim for methods that use function approximation in a provably effective manner to find the best possible policy through systematic exploration. While such problems are central to empirical RL research [22], most theoretical results on systematic exploration have focused on tabular MDPs with small state spaces [e.g., 19]. Until recently, little was known about how to engage in sophisticated exploration in the general function approximation setting to achieve global optimality in a statistically efficient manner. Indeed, as pointed out by Krishnamurthy et al. [20], no algorithm achieving polynomial sample complexity is possible without further assumptions. Nevertheless, when the underlying problem exhibits additional structure, it was recently shown that learning becomes statistically feasible. In particular, Krishnamurthy et al. [20] showed that reactive POMDPs with rich observations and deterministic dynamics over M hidden states can be learned with polynomial sample complexity that depends on M. Later, Jiang et al. [16] provided a new algorithm called O LIVE that In this paper, we directly address this difficult computational challenge. We adopt a reduction approach, meaning that we aim to design algorithms whose computation can be reduced to common optimization oracles over function spaces, such as linear optimization and cost-sensitive classification, while retaining the statistical properties of prior works.


Learning Flexible and Reusable Locomotion Primitives for a Microrobot

arXiv.org Machine Learning

The design of gaits for robot locomotion can be a daunting process which requires significant expert knowledge and engineering. This process is even more challenging for robots that do not have an accurate physical model, such as compliant or micro-scale robots. Data-driven gait optimization provides an automated alternative to analytical gait design. In this paper, we propose a novel approach to efficiently learn a wide range of locomotion tasks with walking robots. This approach formalizes locomotion as a contextual policy search task to collect data, and subsequently uses that data to learn multi-objective locomotion primitives that can be used for planning. As a proof-of-concept we consider a simulated hexapod modeled after a recently developed microrobot, and we thoroughly evaluate the performance of this microrobot on different tasks and gaits. Our results validate the proposed controller and learning scheme on single and multi-objective locomotion tasks. Moreover, the experimental simulations show that without any prior knowledge about the robot used (e.g., dynamics model), our approach is capable of learning locomotion primitives within 250 trials and subsequently using them to successfully navigate through a maze.