Bayesian Inference
Scalable Importance Tempering and Bayesian Variable Selection
Zanella, Giacomo, Roberts, Gareth
We propose a Monte Carlo algorithm to sample from high-dimensional probability distributions that combines Markov chain Monte Carlo (MCMC) and importance sampling. We provide a careful theoretical analysis, including guarantees on robustness to high-dimensionality, explicit comparison with standard MCMC and illustrations of the potential improvements in efficiency. Simple and concrete intuition is provided for when the novel scheme is expected to outperform standard schemes. When applied to Bayesian Variable Selection problems, the novel algorithm is orders of magnitude more efficient than available alternative sampling schemes and allows to perform fast and reliable fully Bayesian inferences with tens of thousands regressors.
Semantic Channel and Shannon's Channel Mutually Match for Multi-Label Classification
A group of transition probability functions form a Shannon's channel whereas a group of truth functions form a semantic channel. Label learning is to let semantic channels match Shannon's channels and label selection is to let Shannon's channels match semantic channels. The Channel Matching (CM) algorithm is provided for multi-label classification. This algorithm adheres to maximum semantic information criterion which is compatible with maximum likelihood criterion and regularized least squares criterion. If samples are very large, we can directly convert Shannon's channels into semantic channels by the third kind of Bayes' theorem; otherwise, we can train truth functions with parameters by sampling distributions. A label may be a Boolean function of some atomic labels. For simplifying learning, we may only obtain the truth functions of some atomic label. For a given label, instances are divided into three kinds (positive, negative, and unclear) instead of two kinds as in popular studies so that the problem with binary relevance is avoided. For each instance, the classifier selects a compound label with most semantic information or richest connotation. As a predictive model, the semantic channel does not change with the prior probability distribution (source) of instances. It still works when the source is changed. The classifier changes with the source, and hence can overcome class-imbalance problem. It is shown that the old population's increasing will change the classifier for label "Old" and has been impelling the semantic evolution of "Old". The CM iteration algorithm for unseen instance classification is introduced.
Road Map for Choosing Between Statistical Modeling and Machine Learning Statistical Thinking
Statistical models (SMs) include ordinary regression, Bayesian regression, semiparametric models, generalized additive models, longitudinal models, time-to-event models, penalized regression, and others. Penalized regression includes ridge regression, lasso, and elastic net. Contrary to what some machine learning (ML) researchers believe, SMs easily allow for complexity (nonlinearity and second-order interactions) and an unlimited number of candidate features (if penalized maximum likelihood estimation or Bayesian models are used). It is especially easy, using regression splines, to allow every continuous predictor to have a smooth nonlinear effect. ML is taken to mean an algorithmic approach that does not use traditional identified statistical parameters, and for which a preconceived structure is not imposed on the relationships between predictors and outcomes. ML usually does not attempt to isolate the effect of any single variable.
A Non-parametric Multi-stage Learning Framework for Cognitive Spectrum Access in IoT Networks
Tholeti, Thulasi, Raj, Vishnu, Kalyani, Sheetal
Given the increasing number of devices that is going to get connected to wireless networks with the advent of Internet of Things, spectrum scarcity will present a major challenge. Application of opportunistic spectrum access mechanisms to IoT networks will become increasingly important to solve this. In this paper, we present a cognitive radio network architecture which uses multi-stage online learning techniques for spectrum assignment to devices, with the aim of improving the throughput and energy efficiency of the IoT devices. In the first stage, we use an AI technique to learn the quality of a user-channel pairing. The next stage utilizes a non-parametric Bayesian learning algorithm to estimate the Primary User OFF time in each channel. The third stage augments the Bayesian learner with implicit exploration to accelerate the learning procedure. The proposed method leads to significant improvement in throughput and energy efficiency of the IoT devices while keeping the interference to the primary users minimal. We provide comprehensive empirical validation of the method with other learning based approaches.
Efficiently Learning Nonstationary Gaussian Processes for Real World Impact
Most real world phenomena such as sunlight distribution under a forest canopy, minerals concentration, stock valuation, exhibit nonstationary dynamics i.e. phenomenon variation changes depending on the locality. Nonstationary dynamics pose both theoretical and practical challenges to statistical machine learning algorithms that aim to accurately capture the complexities governing the evolution of such processes. Typically the nonstationary dynamics are modeled using nonstationary Gaussian Process models (NGPS) that employ local latent dynamics parameterization to correspondingly model the nonstationary real observable dynamics. Recently, an approach based on most likely induced latent dynamics representation attracted research community's attention for a while. The approach could not be employed for large scale real world applications because learning a most likely latent dynamics representation involves maximization of marginal likelihood of the observed real dynamics that becomes intractable as the number of induced latent points grows with problem size. We have established a direct relationship between informativeness of the induced latent dynamics and the marginal likelihood of the observed real dynamics. This opens up the possibility of maximizing marginal likelihood of observed real dynamics indirectly by near optimally maximizing entropy or mutual information gain on the induced latent dynamics using greedy algorithms. Therefore, for an efficient yet accurate inference, we propose to build an induced latent dynamics representation using a novel algorithm LISAL that adaptively maximizes entropy or mutual information on the induced latent dynamics and marginal likelihood of observed real dynamics in an iterative manner. The relevance of LISAL is validated using real world datasets.
From Credit Assignment to Entropy Regularization: Two New Algorithms for Neural Sequence Prediction
Dai, Zihang, Xie, Qizhe, Hovy, Eduard
In this work, we study the credit assignment problem in reward augmented maximum likelihood (RAML) learning, and establish a theoretical equivalence between the token-level counterpart of RAML and the entropy regularized reinforcement learning. Inspired by the connection, we propose two sequence prediction algorithms, one extending RAML with fine-grained credit assignment and the other improving Actor-Critic with a systematic entropy regularization. On two benchmark datasets, we show the proposed algorithms outperform RAML and Actor-Critic respectively, providing new alternatives to sequence prediction.
The Logical Essentials of Bayesian Reasoning
This chapter offers an accessible introduction to the channel-based approach to Bayesian probability theory. This framework rests on algebraic and logical foundations, inspired by the methodologies of programming language semantics. It offers a uniform, structured and expressive language for describing Bayesian phenomena in terms of familiar programming concepts, like channel, predicate transformation and state transformation. The introduction also covers inference in Bayesian networks, which will be modelled by a suitable calculus of string diagrams.
Negative Log Likelihood Ratio Loss for Deep Neural Network Classification
Zhu, Donglai, Yao, Hengshuai, Jiang, Bei, Yu, Peng
Deep neural network (DNN) has achieved remarkable success in classification tasks such as image classification [1]. The network output can mimic the posterior probabilities of target classes for the input observation when the nonlinear activation function in the output layer is defined as a soft-max function [2]. The learning objective is to minimize the difference between the predicted distribution and the true datagenerating distribution. In information theory, the cross entropy between two probability distributions over a common event set of events measures the average number of bits needed to identify an event if coding follows a learned probability distribution rather than the true but unknow distribution [3]. Therefore, cross entropy is a reasonable loss function for the DNN-based classification. However, in practice the true data-generating probability distribution is unknown and replaced by the empirical probability distribution over a training set where each sample is drawn independently and identically distributed (i.i.d.) from the data space [4]. Under assumptions of uniform distributions of feature and label spaces, minimizing cross-entropy is equivalent to maximum likelihood, i.e., the learning problem aims to maximize likelihood of correct class for each of training samples [2]. Maximum likelihood is a generative training criterion by which the model learns the likelihood of correct class for the observation. The model makes predictions by using Bayes rules to calculate posterior probabilities of target classes for the observation and then select the most likely class.
High-dimensional Penalty Selection via Minimum Description Length Principle
Miyaguchi, Kohei, Yamanishi, Kenji
We tackle the problem of penalty selection of regularization on the basis of the minimum description length (MDL) principle. In particular, we consider that the design space of the penalty function is high-dimensional. In this situation, the luckiness-normalized-maximum-likelihood(LNML)-minimization approach is favorable, because LNML quantifies the goodness of regularized models with any forms of penalty functions in view of the minimum description length principle, and guides us to a good penalty function through the high-dimensional space. However, the minimization of LNML entails two major challenges: 1) the computation of the normalizing factor of LNML and 2) its minimization in high-dimensional spaces. In this paper, we present a novel regularization selection method (MDL-RS), in which a tight upper bound of LNML (uLNML) is minimized with local convergence guarantee. Our main contribution is the derivation of uLNML, which is a uniform-gap upper bound of LNML in an analytic expression. This solves the above challenges in an approximate manner because it allows us to accurately approximate LNML and then efficiently minimize it. The experimental results show that MDL-RS improves the generalization performance of regularized estimates specifically when the model has redundant parameters.