Goto

Collaborating Authors

 Optimization


Reset-Free Lifelong Learning with Skill-Space Planning

arXiv.org Artificial Intelligence

The objective of lifelong reinforcement learning (RL) is to optimize agents which can continuously adapt and interact in changing environments. However, current RL approaches fail drastically when environments are non-stationary and interactions are non-episodic. We propose Lifelong Skill Planning (LiSP), an algorithmic framework for non-episodic lifelong RL based on planning in an abstract space of higher-order skills. We learn the skills in an unsupervised manner using intrinsic rewards and plan over the learned skills using a learned dynamics model. Moreover, our framework permits skill discovery even from offline data, thereby reducing the need for excessive real-world interactions. We demonstrate empirically that LiSP successfully enables long-horizon planning and learns agents that can avoid catastrophic failures even in challenging non-stationary and non-episodic environments derived from gridworld and MuJoCo benchmarks.


Adaptive Sequential SAA for Solving Two-stage Stochastic Linear Programs

arXiv.org Machine Learning

We present adaptive sequential SAA (sample average approximation) algorithms to solve large-scale two-stage stochastic linear programs. The iterative algorithm framework we propose is organized into \emph{outer} and \emph{inner} iterations as follows: during each outer iteration, a sample-path problem is implicitly generated using a sample of observations or ``scenarios," and solved only \emph{imprecisely}, to within a tolerance that is chosen \emph{adaptively}, by balancing the estimated statistical error against solution error. The solutions from prior iterations serve as \emph{warm starts} to aid efficient solution of the (piecewise linear convex) sample-path optimization problems generated on subsequent iterations. The generated scenarios can be independent and identically distributed (iid), or dependent, as in Monte Carlo generation using Latin-hypercube sampling, antithetic variates, or randomized quasi-Monte Carlo. We first characterize the almost-sure convergence (and convergence in mean) of the optimality gap and the distance of the generated stochastic iterates to the true solution set. We then characterize the corresponding iteration complexity and work complexity rates as a function of the sample size schedule, demonstrating that the best achievable work complexity rate is Monte Carlo canonical and analogous to the generic $\mathcal{O}(\epsilon^{-2})$ optimal complexity for non-smooth convex optimization. We report extensive numerical tests that indicate favorable performance, due primarily to the use of a sequential framework with an optimal sample size schedule, and the use of warm starts. The proposed algorithm can be stopped in finite-time to return a solution endowed with a probabilistic guarantee on quality.


Convergence of block coordinate descent with diminishing radius for nonconvex optimization

arXiv.org Machine Learning

Block coordinate descent (BCD), also known as nonlinear Gauss-Seidel, is a simple iterative algorithm for nonconvex optimization that sequentially minimizes the objective function in each block coordinate while the other coordinates are held fixed. It is known that block-wise convexity of the objective is not enough to guarantee convergence of BCD to the stationary points and some additional regularity condition is needed. In this work, we provide a simple modification of BCD that has guaranteed global convergence to the stationary points for block-wise convex objective function without additional conditions. Our idea is to restrict the parameter search within a diminishing radius to promote stability of iterates, and then to show that such auxiliary constraint vanishes in the limit. As an application, we provide a modified alternating least squares algorithm for nonnegative CP tensor factorization that is guaranteed to converge to the stationary points of reconstruction error function. We also provide some experimental validation of our result.


Sobolev Wasserstein GAN

arXiv.org Machine Learning

Wasserstein GANs (WGANs), built upon the Kantorovich-Rubinstein (KR) duality of Wasserstein distance, is one of the most theoretically sound GAN models. However, in practice it does not always outperform other variants of GANs. This is mostly due to the imperfect implementation of the Lipschitz condition required by the KR duality. Extensive work has been done in the community with different implementations of the Lipschitz constraint, which, however, is still hard to satisfy the restriction perfectly in practice. In this paper, we argue that the strong Lipschitz constraint might be unnecessary for optimization. Instead, we take a step back and try to relax the Lipschitz constraint. Theoretically, we first demonstrate a more general dual form of the Wasserstein distance called the Sobolev duality, which relaxes the Lipschitz constraint but still maintains the favorable gradient property of the Wasserstein distance. Moreover, we show that the KR duality is actually a special case of the Sobolev duality. Based on the relaxed duality, we further propose a generalized WGAN training scheme named Sobolev Wasserstein GAN (SWGAN), and empirically demonstrate the improvement of SWGAN over existing methods with extensive experiments.


Recent Developments in Boolean Matrix Factorization

arXiv.org Artificial Intelligence

Boolean matrix factorization (BMF) is a variant of the standard matrix factorization problem in the Boolean semiring: given a binary matrix, the task is to find two smaller binary matrices so that their product, taken over the Boolean semiring, is as close to the original matrix as possible. Because the matrix product is not done over a field but over a semiring, many standard matrix factorization techniques fail to work. Indeed, finding the best Boolean factorization is computationally hard. The computational hardness of the problem has not prevented people from studying it. In psychometrics, some of the first algorithms appeared in the 1980's (see Bělohlávek and Trnecka (2018)). Even before that, mathematicians studying combinatorics had studied the "Boolean linear algebra" (Kim, 1982; Monson et al., 1995).


When Do Curricula Work?

arXiv.org Machine Learning

Inspired by human learning, researchers have proposed ordering examples during training based on their difficulty. Both curriculum learning, exposing a network to easier examples early in training, and anti-curriculum learning, showing the most difficult examples first, have been suggested as improvements to the standard i.i.d. training. In this work, we set out to investigate the relative benefits of ordered learning. We first investigate the \emph{implicit curricula} resulting from architectural and optimization bias and find that samples are learned in a highly consistent order. Next, to quantify the benefit of \emph{explicit curricula}, we conduct extensive experiments over thousands of orderings spanning three kinds of learning: curriculum, anti-curriculum, and random-curriculum -- in which the size of the training dataset is dynamically increased over time, but the examples are randomly ordered. We find that for standard benchmark datasets, curricula have only marginal benefits, and that randomly ordered samples perform as well or better than curricula and anti-curricula, suggesting that any benefit is entirely due to the dynamic training set size. Inspired by common use cases of curriculum learning in practice, we investigate the role of limited training time budget and noisy data in the success of curriculum learning. Our experiments demonstrate that curriculum, but not anti-curriculum can indeed improve the performance either with limited training time budget or in existence of noisy data.


Efficient semidefinite-programming-based inference for binary and multi-class MRFs

arXiv.org Machine Learning

Probabilistic inference in pairwise Markov Random Fields (MRFs), i.e. computing the partition function or computing a MAP estimate of the variables, is a foundational problem in probabilistic graphical models. Semidefinite programming relaxations have long been a theoretically powerful tool for analyzing properties of probabilistic inference, but have not been practical owing to the high computational cost of typical solvers for solving the resulting SDPs. In this paper, we propose an efficient method for computing the partition function or MAP estimate in a pairwise MRF by instead exploiting a recently proposed coordinate-descent-based fast semidefinite solver. We also extend semidefinite relaxations from the typical binary MRF to the full multi-class setting, and develop a compact semidefinite relaxation that can again be solved efficiently using the solver. We show that the method substantially outperforms (both in terms of solution quality and speed) the existing state of the art in approximate inference, on benchmark problems drawn from previous work. We also show that our approach can scale to large MRF domains such as fully-connected pairwise CRF models used in computer vision.


Divide and Learn: A Divide and Conquer Approach for Predict+Optimize

arXiv.org Artificial Intelligence

Divide and Learn: A Divide and Conquer Approach for Predict Optimize Authors Ali Ugur Guler, 1 Emir Demirovic, 2 Jeffrey Chan, 3 James Bailey, 1 Christopher Leckie, 1 Peter J. Stuckey, 4 1 University of Melbourne, 2 Delft University of Technology, 3 RMIT University, 4 Monash University aguler@student.unimelb.edu.au, Abstract The predict optimize problem combines machine learning of problem coefficients with a combinatorial optimization problem that uses the predicted coefficients. While this problem can be solved in two separate stages, it is better to directly minimize the optimization loss. However, this requires differentiating through a discrete, non-differentiable combinatorial function. Most existing approaches use some form of surrogate gradient. Demirovic et al showed how to directly express the loss of the optimization problem in terms of the predicted coefficients as a piece-wise linear function. However, their approach is restricted to optimization problems with a dynamic programming formulation. In this work we propose a novel divide and conquer algorithm to tackle optimization problems without this restriction and predict its coefficients using the optimization loss. We also introduce a greedy version of this approach, which achieves similar results with less computation. We compare our approach with other approaches to the predict optimize problem and show we can successfully tackle some hard combinatorial problems better than other predict optimize methods. Introduction Machine Learning ( ML) has gained substantial attention in the last decade, and has proven to be useful in a wide range of industries. ML models usually focus on making accurate predictions by minimizing errors, such as mean squared error ( MSE). These predictions can then be used as coefficients in other decision making processes, such as a combinatorial optimization problem.


Cognitive Capabilities for the CAAI in Cyber-Physical Production Systems

arXiv.org Artificial Intelligence

This paper presents the cognitive module of the cognitive architecture for artificial intelligence (CAAI) in cyber-physical production systems (CPPS). The goal of this architecture is to reduce the implementation effort of artificial intelligence (AI) algorithms in CPPS. Declarative user goals and the provided algorithm-knowledge base allow the dynamic pipeline orchestration and configuration. A big data platform (BDP) instantiates the pipelines and monitors the CPPS performance for further evaluation through the cognitive module. Thus, the cognitive module is able to select feasible and robust configurations for process pipelines in varying use cases. Furthermore, it automatically adapts the models and algorithms based on model quality and resource consumption. The cognitive module also instantiates additional pipelines to test algorithms from different classes. CAAI relies on well-defined interfaces to enable the integration of additional modules and reduce implementation effort. Finally, an implementation based on Docker, Kubernetes, and Kafka for the virtualization and orchestration of the individual modules and as messaging-technology for module communication is used to evaluate a real-world use case.


A similarity-based Bayesian mixture-of-experts model

arXiv.org Machine Learning

We present a new nonparametric mixture-of-experts model for multivariate regression problems, inspired by the probabilistic $k$-nearest neighbors algorithm. Using a conditionally specified model, predictions for out-of-sample inputs are based on similarities to each observed data point, yielding predictive distributions represented by Gaussian mixtures. Posterior inference is performed on the parameters of the mixture components as well as the distance metric using a mean-field variational Bayes algorithm accompanied with a stochastic gradient-based optimization procedure. The proposed method is especially advantageous in settings where inputs are of relatively high dimension in comparison to the data size, where input--output relationships are complex, and where predictive distributions may be skewed or multimodal. Computational studies on two synthetic datasets and one dataset comprising dose statistics of radiation therapy treatment plans show that our mixture-of-experts method outperforms a Gaussian process benchmark model both in terms of validation metrics and visual inspection.