Goto

Collaborating Authors

 Uncertainty


Asynchronous Stochastic Variational Inference

arXiv.org Machine Learning

Stochastic variational inference (SVI) employs stochastic optimization to scale up Bayesian computation to massive data. Since SVI is at its core a stochastic gradient-based algorithm, horizontal parallelism can be harnessed to allow larger scale inference. We propose a lock-free parallel implementation for SVI which allows distributed computations over multiple slaves in an asynchronous style. We show that our implementation leads to linear speed-up while guaranteeing an asymptotic ergodic convergence rate $O(1/\sqrt(T)$ ) given that the number of slaves is bounded by $\sqrt(T)$ ($T$ is the total number of iterations). The implementation is done in a high-performance computing (HPC) environment using message passing interface (MPI) for python (MPI4py). The extensive empirical evaluation shows that our parallel SVI is lossless, performing comparably well to its counterpart serial SVI with linear speed-up.


Stochastic Maximum Likelihood Optimization via Hypernetworks

arXiv.org Machine Learning

This work explores maximum likelihood optimization of neural networks through hypernetworks. A hypernetwork initializes the weights of another network, which in turn can be employed for typical functional tasks such as regression and classification. We optimize hypernetworks to directly maximize the conditional likelihood of target variables given input. Using this approach we obtain competitive empirical results on regression and classification benchmarks.


GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

arXiv.org Machine Learning

Generative Adversarial Networks (GANs) excel at creating realistic images with complex models for which maximum likelihood is infeasible. However, the convergence of GAN training has still not been proved. We propose a two time-scale update rule (TTUR) for training GANs with stochastic gradient descent on arbitrary GAN loss functions. TTUR has an individual learning rate for both the discriminator and the generator. Using the theory of stochastic approximation, we prove that the TTUR converges under mild assumptions to a stationary local Nash equilibrium. The convergence carries over to the popular Adam optimization, for which we prove that it follows the dynamics of a heavy ball with friction and thus prefers flat minima in the objective landscape. For the evaluation of the performance of GANs at image generation, we introduce the "Fr\'echet Inception Distance" (FID) which captures the similarity of generated images to real ones better than the Inception Score. In experiments, TTUR improves learning for DCGANs and Improved Wasserstein GANs (WGAN-GP) outperforming conventional GAN training on CelebA, CIFAR-10, SVHN, LSUN Bedrooms, and the One Billion Word Benchmark.


Masked Autoregressive Flow for Density Estimation

arXiv.org Machine Learning

Autoregressive models are among the best performing neural density estimators. We describe an approach for increasing the flexibility of an autoregressive model, based on modelling the random numbers that the model uses internally when generating data. By constructing a stack of autoregressive models, each modelling the random numbers of the next model in the stack, we obtain a type of normalizing flow suitable for density estimation, which we call Masked Autoregressive Flow. This type of flow is closely related to Inverse Autoregressive Flow and is a generalization of Real NVP. Masked Autoregressive Flow achieves state-of-the-art performance in a range of general-purpose density estimation tasks.


Active Community Detection: A Maximum Likelihood Approach

arXiv.org Machine Learning

We propose novel semi-supervised and active learning algorithms for the problem of community detection on networks. The algorithms are based on optimizing the likelihood function of the community assignments given a graph and an estimate of the statistical model that generated it. The optimization framework is inspired by prior work on the unsupervised community detection problem in Stochastic Block Models (SBM) using Semi-Definite Programming (SDP). In this paper we provide the next steps in the evolution of learning communities in this context which involves a constrained semi-definite programming algorithm, and a newly presented active learning algorithm. The active learner intelligently queries nodes that are expected to maximize the change in the model likelihood. Experimental results show that this active learning algorithm outperforms the random-selection semi-supervised version of the same algorithm as well as other state-of-the-art active learning algorithms. Our algorithms significantly improved performance is demonstrated on both real-world and SBM-generated networks even when the SBM has a signal to noise ratio (SNR) below the known unsupervised detectability threshold.


Multivariate Bayesian Structural Time Series Model

arXiv.org Machine Learning

This paper deals with inference and prediction for multiple correlated time series, where one has also the choice of using a candidate pool of contemporaneous predictors for each target series. Starting with a structural model for the time-series, Bayesian tools are used for model fitting, prediction, and feature selection, thus extending some recent work along these lines for the univariate case. The Bayesian paradigm in this multivariate setting helps the model avoid overfitting as well as capture correlations among the multiple time series with the various state components. The model provides needed flexibility to choose a different set of components and available predictors for each target series. The cyclical component in the model can handle large variations in the short term, which may be caused by external shocks. We run extensive simulations to investigate properties such as estimation accuracy and performance in forecasting. We then run an empirical study with one-step-ahead prediction on the max log return of a portfolio of stocks that involve four leading financial institutions. Both the simulation studies and the extensive empirical study confirm that this multivariate model outperforms three other benchmark models, viz. a model that treats each target series as independent, the autoregressive integrated moving average model with regression (ARIMAX), and the multivariate ARIMAX (MARIMAX) model.


Sales forecasting and risk management under uncertainty in the media industry

arXiv.org Machine Learning

In this work we propose a data-driven modelization approach for the management of advertising investments of a firm. First, we propose an application of dynamic linear models to the prediction of an economic variable, such as global sales, which can use information from the environment and the investment levels of the company in different channels. After we build a robust and precise model, we propose a metric of risk, which can help the firm to manage their advertisement plans, thus leading to a robust, risk-aware optimization of their revenue. The advertising industry represents an estimate of US$ 529.43 billion [4] and this quantity is likely to increase in the following years. In parallel, there has been a recent interest in the application of data-driven models in the context of forecasting and decision making in the marketing industry.


Scalable high-resolution forecasting of sparse spatiotemporal events with kernel methods: a winning solution to the NIJ "Real-Time Crime Forecasting Challenge"

arXiv.org Machine Learning

This article describes Team Kernel Glitches' solution to the National Institute of Justice's (NIJ) Real-Time Crime Forecasting Challenge. The goal of the NIJ Real-Time Crime Forecasting Competition was to maximize two different crime hotspot scoring metrics for calls-for-service to the Portland Police Bureau (PPB) in Portland, Oregon during the period from March 1, 2017 to May 31, 2017. Our solution to the challenge is a spatiotemporal forecasting model combining scalable randomized Reproducing Kernel Hilbert Space (RKHS) methods for approximating Gaussian processes with autoregressive smoothing kernels in a regularized supervised learning framework. Our model can be understood as an approximation to the popular log-Gaussian Cox Process model: we discretize the spatiotemporal point pattern and learn a log intensity function using the Poisson likelihood and highly efficient gradient-based optimization methods. Model hyperparameters including quality of RKHS approximation, spatial and temporal kernel lengthscales, number of autoregressive lags, bandwidths for smoothing kernels, as well as cell shape, size, and rotation, were learned using crossvalidation. Resulting predictions exceeded baseline KDE estimates by 0.157. Performance improvement over baseline predictions were particularly large for sparse crimes over short forecasting horizons.


Log-concave sampling: Metropolis-Hastings algorithms are fast!

arXiv.org Machine Learning

We consider the problem of sampling from a strongly log-concave density in $\mathbb{R}^d$, and prove a non-asymptotic upper bound on the mixing time of the Metropolis-adjusted Langevin algorithm (MALA). The method draws samples by running a Markov chain obtained from the discretization of an appropriate Langevin diffusion, combined with an accept-reject step to ensure the correct stationary distribution. Relative to known guarantees for the unadjusted Langevin algorithm (ULA), our bounds show that the use of an accept-reject step in MALA leads to an exponentially improved dependence on the error-tolerance. Concretely, in order to obtain samples with TV error at most $\delta$ for a density with condition number $\kappa$, we show that MALA requires $\mathcal{O} \big(\kappa d \log(1/\delta) \big)$ steps, as compared to the $\mathcal{O} \big(\kappa^2 d/\delta^2 \big)$ steps established in past work on ULA. We also demonstrate the gains of MALA over ULA for weakly log-concave densities. Furthermore, we derive mixing time bounds for a zeroth-order method Metropolized random walk (MRW) and show that it mixes $\mathcal{O}(\kappa d)$ slower than MALA. We provide numerical examples that support our theoretical findings, and demonstrate the potential gains of Metropolis-Hastings adjustment for Langevin-type algorithms.


Bayesian Methods for Machine Learning Coursera

@machinelearnbot

About this course: Bayesian methods are used in lots of fields: from game development to drug discovery. They give superpowers to many machine learning algorithms: handling missing data, extracting much more information from small datasets. Bayesian methods also allow us to estimate uncertainty in predictions, which is a really desirable feature for fields like medicine. When Bayesian methods are applied to deep learning, it turns out that they allow you to compress your models 100 folds, and automatically tune hyperparametrs, saving your time and money. In six weeks we will discuss the basics of Bayesian methods: from how to define a probabilistic model to how to make predictions from it.