Undirected Networks
A Probabilistic Framework for Nonlinearities in Stochastic Neural Networks
Su, Qinliang, Liao, Xuejun, Carin, Lawrence
We present a probabilistic framework for nonlinearities, based on doubly truncated Gaussian distributions. By setting the truncation points appropriately, we are able to generate various types of nonlinearities within a unified framework, including sigmoid, tanh and ReLU, the most commonly used nonlinearities in neural networks. The framework readily integrates into existing stochastic neural networks (with hidden units characterized as random variables), allowing one for the first time to learn the nonlinearities alongside model weights in these networks. Extensive experiments demonstrate the performance improvements brought about by the proposed framework when integrated with the restricted Boltzmann machine (RBM), temporal RBM and the truncated Gaussian graphical model (TGGM).
Measuring Sample Quality with Kernels
Gorham, Jackson, Mackey, Lester
Approximate Markov chain Monte Carlo (MCMC) offers the promise of more rapid sampling at the cost of more biased inference. Since standard MCMC diagnostics fail to detect these biases, researchers have developed computable Stein discrepancy measures that provably determine the convergence of a sample to its target distribution. This approach was recently combined with the theory of reproducing kernels to define a closed-form kernel Stein discrepancy (KSD) computable by summing kernel evaluations across pairs of sample points. We develop a theory of weak convergence for KSDs based on Stein's method, demonstrate that commonly used KSDs fail to detect non-convergence even for Gaussian targets, and show that kernels with slowly decaying tails provably determine convergence for a large class of target distributions. The resulting convergence-determining KSDs are suitable for comparing biased, exact, and deterministic sample sequences and simpler to compute and parallelize than alternative Stein discrepancies. We use our tools to compare biased samplers, select sampler hyperparameters, and improve upon existing KSD approaches to one-sample hypothesis testing and sample quality improvement.
Switching nonparametric regression models for multi-curve data
de Souza, Camila P. E., Heckman, Nancy E., Xu, Helena
We develop and apply an approach for analyzing multi-curve data where each curve is driven by a latent state process. The state at any particular point determines a smooth function, forcing the individual curve to switch from one function to another. Thus each curve follows what we call a switching nonparametric regression model. We develop an EM algorithm to estimate the model parameters. We also obtain standard errors for the parameter estimates of the state process. We consider several types of state processes: independent and identically distributed, independent but depending on a covariate and Markov. Simulation studies show the frequentist properties of our estimates. We apply our methods to a data set of a building's power usage.
A Brief Introduction to Machine Learning for Engineers
Department of Informatics, King's College London; osvaldo.simeone@kcl.ac.uk ABSTRACT This monograph aims at providing an introduction to key concepts, algorithms, and theoretical frameworks in machine learning, including supervised and unsupervised learning, statistical learning theory, probabilistic graphical models and approximate inference. The intended readership consists of electrical engineers with a background in probability and linear algebra. The treatment builds on first principles, and organizes the main ideas according to clearly defined categories, such as discriminative and generative models, frequentist and Bayesian approaches, exact and approximate inference, directed and undirected models, and convex and non-convex optimization. The mathematical framework uses information-theoretic measures as a unifying tool. The text offers simple and reproducible numerical examples providing insights into key motivations and conclusions. Rather than providing exhaustive details on the existing myriad solutions in each specific category, for which the reader is referred to textbooks and papers, this monograph is meant as an entry point for an engineer into the literature on machine learning.
Phase transitions in Restricted Boltzmann Machines with generic priors
Barra, Adriano, Genovese, Giuseppe, Sollich, Peter, Tantari, Daniele
We present a complete analysis of the replica symmetric phase diagram of these systems, which can be regarded as Generalised Hopfield models. We underline the role of the retrieval phase for both inference and learning processes and we show that retrieval is robust for a large class of weight and unit priors, beyond the standard Hopfield scenario. Furthermore we show how the paramagnetic phase boundary is directly related to the optimal size of the training set necessary for good generalisation in a teacher-student scenario of unsupervised learning. In recent years supervised machine learning with neural networks has found renewed interest from the practical success of so-called deep networks in solving several difficult problems, ranging from image classification to speech recognition and video segmentation [1]. Despite this remarkable progress, unsupervised learning with neural networks, in which the structure of data is learned without a priori knowledge of a specific task, still lacks a solid theoretical scaffold. Such learning of hidden features of complex data in high dimensional spaces by fitting a generative probabilistic model is used for de-noising, completion and data generation, but also as a dimensionality reduction pre-training step in supervised methods [7, 8].
"I can assure you [$\ldots$] that it's going to be all right" -- A definition, case for, and survey of algorithmic assurances in human-autonomy trust relationships
In essence, people who interact with advanced technology want to be able to trust it appropriately, and then act on that trust. In interpersonal relationships, and otherwise, humans act largely based on trust. For example, a supervisor asks a subordinate to accomplish a task based on several factors that indicate they can trust them to accomplish that task. When consumers make purchases, they do so with trust that the product will perform as promised. Likewise, when using something like an autonomous vehicle, the user must be able to trust it appropriately in order to use it properly. With the rapid advancement of the capabilities of intelligent computing technology to do tasks that were previously assumed to be too complicated for computers, there has been much recent discussion regarding how humans can trust this technology - although the connection to trust is not always made explicit, per se.
A State-Space Approach to Dynamic Nonnegative Matrix Factorization
Mohammadiha, Nasser, Smaragdis, Paris, Panahandeh, Ghazaleh, Doclo, Simon
Nonnegative matrix factorization (NMF) has been actively investigated and used in a wide range of problems in the past decade. A significant amount of attention has been given to develop NMF algorithms that are suitable to model time series with strong temporal dependencies. In this paper, we propose a novel state-space approach to perform dynamic NMF (D-NMF). In the proposed probabilistic framework, the NMF coefficients act as the state variables and their dynamics are modeled using a multi-lag nonnegative vector autoregressive (N-VAR) model within the process equation. We use expectation maximization and propose a maximum-likelihood estimation framework to estimate the basis matrix and the N-VAR model parameters. Interestingly, the N-VAR model parameters are obtained by simply applying NMF. Moreover, we derive a maximum a posteriori estimate of the state variables (i.e., the NMF coefficients) that is based on a prediction step and an update step, similarly to the Kalman filter. We illustrate the benefits of the proposed approach using different numerical simulations where D-NMF significantly outperforms its static counterpart. Experimental results for three different applications show that the proposed approach outperforms two state-of-the-art NMF approaches that exploit temporal dependencies, namely a nonnegative hidden Markov model and a frame stacking approach, while it requires less memory and computational power.
Asymptotic Bias of Stochastic Gradient Search
Tadic, Vladislav B., Doucet, Arnaud
The asymptotic behavior of the stochastic gradient algorithm with a biased gradient estimator is analyzed. Relying on arguments based on the dynamic system theory (chain-recurrence) and the differential geometry (Yomdin theorem and Lojasiewicz inequality), tight bounds on the asymptotic bias of the iterates generated by such an algorithm are derived. The obtained results hold under mild conditions and cover a broad class of high-dimensional nonlinear algorithms. Using these results, the asymptotic properties of the policy-gradient (reinforcement) learning and adaptive population Monte Carlo sampling are studied. Relying on the same results, the asymptotic behavior of the recursive maximum split-likelihood estimation in hidden Markov models is analyzed, too.
Quantifying the accuracy of approximate diffusions and Markov chains
Huggins, Jonathan H., Zou, James
Markov chains and diffusion processes are indispensable tools in machine learning and statistics that are used for inference, sampling, and modeling. With the growth of large-scale datasets, the computational cost associated with simulating these stochastic processes can be considerable, and many algorithms have been proposed to approximate the underlying Markov chain or diffusion. A fundamental question is how the computational savings trade off against the statistical error incurred due to approximations. This paper develops general results that address this question. We bound the Wasserstein distance between the equilibrium distributions of two diffusions as a function of their mixing rates and the deviation in their drifts. We show that this error bound is tight in simple Gaussian settings. Our general result on continuous diffusions can be discretized to provide insights into the computational-statistical trade-off of Markov chains. As an illustration, we apply our framework to derive finite-sample error bounds of approximate unadjusted Langevin dynamics. We characterize computation-constrained settings where, by using fast-to-compute approximate gradients in the Langevin dynamics, we obtain more accurate samples compared to using the exact gradients. Finally, as an additional application of our approach, we quantify the accuracy of approximate zig-zag sampling. Our theoretical analyses are supported by simulation experiments.