Goto

Collaborating Authors

 Bayesian Learning


Extrapolating Expected Accuracies for Large Multi-Class Problems

arXiv.org Machine Learning

Many machine learning tasks are interested in recognizing or identifying an individual instance within a large set of possible candidates. These problems are usually modeled as multi-class classification problems, with a large and possibly complex label set. Leading examples include detecting the speaker from his voice patterns (Togneri and Pullella, 2011), identifying the author from her written text (Stamatatos et al., 2014), or labeling the object category from its image (Duygulu et al., 2002, Deng et al., 2010, Oquab et al., 2014). In all these examples, the algorithm observes an input x, and uses the classifier function h to guess the label y from a large label set S. 1 There are multiple practical challenges in developing classifiers for large label sets. Collecting high quality training data is perhaps the main obstacle, as the costs scale with the number of classes. It can be affordable to first collect data for a small set of classes, even if the long-term goal is to generalize to a larger set. Furthermore, classifier development can be accelerated by training first on fewer classes, as each training cycle may require substantially less resources. Indeed, due to interest in how small-set performance generalizes to larger sets, such comparisons can found in the literature (Oquab et al., 2014, Griffin et al., 2007). A natural question is: how does changing the size of the label set affect the classification accuracy?


Robust Loss Functions under Label Noise for Deep Neural Networks

arXiv.org Machine Learning

In many applications of classifier learning, training data suffers from label noise. Deep networks are learned using huge training data where the problem of noisy labels is particularly relevant. The current techniques proposed for learning deep networks under label noise focus on modifying the network architecture and on algorithms for estimating true labels from noisy labels. An alternate approach would be to look for loss functions that are inherently noise-tolerant. For binary classification there exist theoretical results on loss functions that are robust to label noise. In this paper, we provide some sufficient conditions on a loss function so that risk minimization under that loss function would be inherently tolerant to label noise for multiclass classification problems. These results generalize the existing results on noise-tolerant loss functions for binary classification. We study some of the widely used loss functions in deep networks and show that the loss function based on mean absolute value of error is inherently robust to label noise. Thus standard back propagation is enough to learn the true classifier even under label noise. Through experiments, we illustrate the robustness of risk minimization with such loss functions for learning neural networks.


On Connecting Stochastic Gradient MCMC and Differential Privacy

arXiv.org Machine Learning

Significant success has been realized recently on applying machine learning to real-world applications. There have also been corresponding concerns on the privacy of training data, which relates to data security and confidentiality issues. Differential privacy provides a principled and rigorous privacy guarantee on machine learning models. While it is common to design a model satisfying a required differential-privacy property by injecting noise, it is generally hard to balance the trade-off between privacy and utility. We show that stochastic gradient Markov chain Monte Carlo (SG-MCMC) -- a class of scalable Bayesian posterior sampling algorithms proposed recently -- satisfies strong differential privacy with carefully chosen step sizes. We develop theory on the performance of the proposed differentially-private SG-MCMC method. We conduct experiments to support our analysis and show that a standard SG-MCMC sampler without any modification (under a default setting) can reach state-of-the-art performance in terms of both privacy and utility on Bayesian learning.


Bayesian Computational Analyses with R Udemy

@machinelearnbot

Bayesian Computational Analyses with R is an introductory course on the use and implementation of Bayesian modeling using R software. The Bayesian approach is an alternative to the "frequentist" approach where one simply takes a sample of data and makes inferences about the likely parameters of the population. In contrast, the Bayesian approach uses both likelihood functions and a sample of observed data (the'prior') to estimate the most likely values and distributions for the estimated population parameters (the'posterior'). The course is useful to anyone who wishes to learn about Bayesian concepts and is suited to both novice and intermediate Bayesian students and Bayesian practitioners. It is both a practical, "hands-on" course with many examples using R scripts and software, and is conceptual, as the course explains the Bayesian concepts. All materials, software, R scripts, slides, exercises and solutions are included with the course materials.


On Statistical Optimality of Variational Bayes

arXiv.org Machine Learning

Variational inference [25, 7, 40] is now a well-established tool to approximate intractable posterior distributions in hierarchical multi-layered Bayesian models. The traditional Markov chain Monte Carlo (MCMC; [17]) approach of approximating distributions with intractable normalizing constants draws (correlated) samples according to a discrete-time Markov chain whose stationary distribution is the target distribution. Despite their success and popularity, MCMC methods can be slow to converge and lack scalability in big data problems and/or problems involving very many latent variables, which has fueled search for alternatives. In contrast to the sampling approach of MCMC, variational inference approaches the problem from an optimization viewpoint. First, a class of analytically tractable distributions, referred to as the variational family, is identified for the problem at hand. For example, in mean-field approximation, the set of parameters and latent variables is divided into blocks and the variational distribution is assumed to be independent across blocks.


Estimating the Probability of Meeting a Deadline in Hierarchical Plans

arXiv.org Artificial Intelligence

Given a hierarchical plan (or schedule) with uncertain task times, we propose a deterministic polynomial (time and memory) algorithm for estimating the probability that its meets a deadline, or, alternately, that its {\em makespan} is less than a given duration. Approximation is needed as it is known that this problem is NP-hard even for sequential plans (just, a sum of random variables). In addition, we show two new complexity results: (1) Counting the number of events that do not cross deadline is \#P-hard; (2)~Computing the expected makespan of a hierarchical plan is NP-hard. For the proposed approximation algorithm, we establish formal approximation bounds and show that the time and memory complexities grow polynomially with the required accuracy, the number of nodes in the plan, and with the size of the support of the random variables that represent the durations of the primitive tasks. We examine these approximation bounds empirically and demonstrate, using task networks taken from the literature, how our scheme outperforms sampling techniques and exact computation in terms of accuracy and run-time. As the empirical data shows much better error bounds than guaranteed, we also suggest a method for tightening the bounds in some cases.


An Approximate Bayesian Long Short-Term Memory Algorithm for Outlier Detection

arXiv.org Machine Learning

Abstract--Long Short-T erm Memory networks trained with gradient descent and back-propagation have received great success in various applications. However, point estimation of the weights of the networks is prone to over-fitting problems and lacks important uncertainty information associated with the estimation. However, exact Bayesian neural network methods are intractable and non-applicable for real-world applications. In this study, we propose an approximate estimation of the weights uncertainty using Ensemble Kalman Filter, which is easily scalable to a large number of weights. T o assess the proposed algorithm, we apply it to outlier detection in five real-world events retrieved from the Twitter platform. I NTRODUCTION The recent resurgence of neural network trained with back-propagation has established state-of-art results in a wide range of domains. However, backpropagation-based neural networks (NN) are associated with many disadvantages, including but not limited to, the lack of uncertainty estimation, tendency of overfitting small data, and tuning of many hyper-parameters.


Truncated Variational Expectation Maximization

arXiv.org Machine Learning

We derive a novel variational expectation maximization approach based on truncated variational distributions. Truncated distributions are proportional to exact posteriors within a subset of a discrete state space and equal zero otherwise. The novel variational approach is realized by first generalizing the standard variational EM framework to include variational distributions with exact (`hard') zeros. A fully variational treatment of truncated distributions then allows for deriving novel and mathematically grounded results, which in turn can be used to formulate novel efficient algorithms to optimize the parameters of probabilistic generative models. We find the free energies which correspond to truncated distributions to be given by concise and efficiently computable expressions, while update equations for model parameters (M-steps) remain in their standard form. Furthermore, we obtain generic expressions for expectation values w.r.t. truncated distributions. Based on these observations, we show how efficient and easily applicable meta-algorithms can be formulated that guarantee a monotonic increase of the free energy. Example applications of the here derived framework provide novel theoretical results and learning procedures for latent variable models as well as mixture models including procedures to tightly couple sampling and variational optimization approaches. Furthermore, by considering a special case of truncated variational distributions, we can cleanly and fully embed the well-known `hard EM' approaches into the variational EM framework, and we show that `hard EM' (for models with discrete latents) provably optimizes a lower free energy bound of the data log-likelihood.


Neural Networks Regularization Through Class-wise Invariant Representation Learning

arXiv.org Machine Learning

Training deep neural networks is known to require a large number of training samples. However, in many applications only few training samples are available. In this work, we tackle the issue of training neural networks for classification task when few training samples are available. We attempt to solve this issue by proposing a new regularization term that constrains the hidden layers of a network to learn class-wise invariant representations. In our regularization framework, learning invariant representations is generalized to the class membership where samples with the same class should have the same representation. Numerical experiments over MNIST and its variants showed that our proposal helps improving the generalization of neural network particularly when trained with few samples.


Boosted Generative Models

arXiv.org Artificial Intelligence

We propose a novel approach for using unsupervised boosting to create an ensemble of generative models, where models are trained in sequence to correct earlier mistakes. Our meta-algorithmic framework can leverage any existing base learner that permits likelihood evaluation, including recent deep expressive models. Further, our approach allows the ensemble to include discriminative models trained to distinguish real data from model-generated data. We show theoretical conditions under which incorporating a new model in the ensemble will improve the fit and empirically demonstrate the effectiveness of our black-box boosting algorithms on density estimation, classification, and sample generation on benchmark datasets for a wide range of generative models.