Goto

Collaborating Authors

 Bayesian Learning


Generalizing Information to the Evolution of Rational Belief

arXiv.org Machine Learning

Information theory provides a mathematical foundation to measure uncertainty in belief. Belief is represented by a probability distribution that captures our understanding of an outcome's plausibility. Information measures based on Shannon's concept of entropy include realization information, Kullback-Leibler divergence, Lindley's information in experiment, cross entropy, and mutual information. We derive a general theory of information from first principles that accounts for evolving belief and recovers all of these measures. Rather than simply gauging uncertainty, information is understood in this theory to measure change in belief. We may then regard entropy as the information we expect to gain upon realization of a discrete latent random variable. This theory of information is compatible with the Bayesian paradigm in which rational belief is updated as evidence becomes available. Furthermore, this theory admits novel measures of information with well-defined properties, which we explore in both analysis and experiment. This view of information illuminates the study of machine learning by allowing us to quantify information captured by a predictive model and distinguish it from residual information contained in training data. We gain related insights regarding feature selection, anomaly detection, and novel Bayesian approaches.


Convolutional Mixture Density Recurrent Neural Network for Predicting User Location with WiFi Fingerprints

arXiv.org Machine Learning

ABSTRACT Predicting smartphone users activity using WiFi fingerprints has been a popular approach for indoor positioning in recent years. However, such a high dimensional time-series prediction problem can be very tricky to solve. To address this issue, we propose a novel deep learning model, the convolutional mixture density recurrent neural network (CMDRNN), which combines the strengths of convolutional neural networks, recurrent neural networks and mixture density networks. In our model, the CNN sub-model is employed to detect the feature of the high dimensional input, the RNN sub-model is utilized to capture the time dependency and the MDN sub-model is for predicting the final output. For validation, we conduct the experiments on the real-world dataset and the obtained results illustrate the effectiveness of our method.


Gradient-based Optimization for Bayesian Preference Elicitation

arXiv.org Artificial Intelligence

Effective techniques for eliciting user preferences have taken on added importance as recommender systems (RSs) become increasingly interactive and conversational. A common and conceptually appealing Bayesian criterion for selecting queries is expected value of information (EVOI) . Unfortunately, it is computationally prohibitive to construct queries with maximum EVOI in RSs with large item spaces. We tackle this issue by introducing a continuous formulation of EVOI as a differentiable network that can be optimized using gradient methods available in modern machine learning (ML) computational frameworks (e.g., TensorFlow, PyTorch). We exploit this to develop a novel, scalable Monte Carlo method for EVOI optimization, which is more scalable for large item spaces than methods requiring explicit enumeration of items. While we emphasize the use of this approach for pairwise (or k -wise) comparisons of items, we also demonstrate how our method can be adapted to queries involving subsets of item attributes or "partial items," which are often more cognitively manageable for users. Experiments show that our gradient-based EVOI technique achieves state-of-the-art performance across several domains while scaling to large item spaces.


On Universal Features for High-Dimensional Learning and Inference

arXiv.org Machine Learning

We consider the problem of identifying universal low-dimensional features from high-dimensional data for inference tasks in settings involving learning. For such problems, we introduce natural notions of universality and we show a local equivalence among them. Our analysis is naturally expressed via information geometry, and represents a conceptually and computationally useful analysis. The development reveals the complementary roles of the singular value decomposition, Hirschfeld-Gebelein-R\'enyi maximal correlation, the canonical correlation and principle component analyses of Hotelling and Pearson, Tishby's information bottleneck, Wyner's common information, Ky Fan $k$-norms, and Brieman and Friedman's alternating conditional expectations algorithm. We further illustrate how this framework facilitates understanding and optimizing aspects of learning systems, including multinomial logistic (softmax) regression and the associated neural network architecture, matrix factorization methods for collaborative filtering and other applications, rank-constrained multivariate linear regression, and forms of semi-supervised learning.


Iterative Peptide Modeling With Active Learning And Meta-Learning

arXiv.org Machine Learning

Often the development of novel materials is not amenable to high-throughput or purely computational screening methods. Instead, materials must be synthesized one at a time in a process that does not generate significant amounts of data. One way this method can be improved is by ensuring that each experiment provides the best improvement in both material properties and predictive modeling accuracy. In this work, we study the effectiveness of active learning, which optimizes the order of experiments, and meta learning, which transfers knowledge from one context to another, to reduce the number of experiments necessary to build a predictive model. We present a novel multi-task benchmark database of peptides designed to advance active, few-shot, and meta-learning methods for experimental design. Each task is binary classification of peptides represented as a sequence string. We show results of standard active learning and meta-learning methods across these datasets to assess their ability to improve predictive models with the fewest number of experiments. We find the ensemble query by committee active learning method to be effective. The meta-learning method Reptile was found to improve accuracy. The robustness of these conclusions were tested across multiple model choices.


Additive Bayesian Network Modelling with the R Package abn

arXiv.org Machine Learning

It is a particularly well-suited approach to better understand the underlying structure of data when scientific understanding of the data is at an early stage. BN modelling is designed to sort out directly from indirectly related variables and offers a far richer modelling framework than classical approaches in epidemiology like, e.g., regression techniques or extensions thereof. In contrast to structural equation modelling (Hair, Black, Babin, Anderson, Tatham et al. 1998), which requires expert knowledge to design the model, the Additive Bayesian Network (ABN) method is a data-driven approach (Lewis and Ward 2013; Kratzer, Pittavino, Lewis, and Furrer 2019b). It does not rely on expert knowledge, but it can possiarXiv:1911.09006v1


Consistent Robust Adversarial Prediction for General Multiclass Classification

arXiv.org Machine Learning

Some example of the task are the zero-one loss classification where the predictor suffers a loss of one when making incorrect prediction and zero otherwise as well as the ordinal classification (also known as ordinal regression) where the predictor suffers a loss that increases as the prediction moves away from the true label. Empirical risk minimization (ERM) (Vapnik, 1992) is a standard approach for solving general multiclass classification problems by finding the classifier that minimizes a loss metric over the training data. However, since directly minimizing this loss over training data within the ERM framework is generally NPhard (Steinwart and Christmann, 2008), convex surrogate losses that can be efficiently optimized are employed to approximate the loss. Constructing surrogate losses for binary classification has been well studied, resulting in surrogate losses that enjoy desirable theoretical properties and good performance in practice. Among the popular examples are the logarithmic loss, which is minimized by the logistic regression classifier (McCullagh and Nelder, 1989), and the hinge loss, which is minimized by the support vector machine (SVM) (Boser et al., 1992; Cortes and Vapnik, 1995).


LionForests: Local Interpretation of Random Forests through Path Selection

arXiv.org Artificial Intelligence

Towards a future where machine learning systems will integrate into every aspect of people's lives, researching methods to interpret such systems is necessary, instead of focusing exclusively on enhancing their performance. Enriching the trust between these systems and people will accelerate this integration process. Many medical and retail banking/finance applications use state-of-the-art machine learning techniques to predict certain aspects of new instances. Tree ensembles, like random forests, are widely acceptable solutions on these tasks, while at the same time they are avoided due to their black-box uninterpretable nature, creating an unreasonable paradox. In this paper, we provide a sequence of actions for shedding light on the predictions of the misjudged family of tree ensemble algorithms. Using classic unsupervised learning techniques and an enhanced similarity metric, to wander among transparent trees inside a forest following breadcrumbs, the interpretable essence of tree ensembles arises. An explanation provided by these systems using our approach, which we call "LionForests", can be a simple, comprehensive rule.


Bayesian sparse convex clustering via global-local shrinkage priors

arXiv.org Machine Learning

Sparse convex clustering is to cluster observations and conduct variable selection simultaneously in the framework of convex clustering. Although the weighted $L_1$ norm as the regularization term is usually employed in the sparse convex clustering, this increases the dependence on the data and reduces the estimation accuracy if the sample size is not sufficient. To tackle these problems, this paper proposes a Bayesian sparse convex clustering via the idea of Bayesian lasso and global-local shrinkage priors. We introduce Gibbs sampling algorithms for our method using scale mixtures of normals. The effectiveness of the proposed methods is shown in simulation studies and a real data analysis.


Predictive properties of forecast combination, ensemble methods, and Bayesian predictive synthesis

arXiv.org Machine Learning

This paper studies the theoretical predictive properties of classes of forecast combination methods. The study is motivated by the recently developed Bayesian framework for synthesizing predictive densities: Bayesian predictive synthesis. A novel strategy based on continuous time stochastic processes is proposed and developed, where the combined predictive error processes are expressed as stochastic differential equations, evaluated using Ito's lemma. We show that a subclass of synthesis functions under Bayesian predictive synthesis, which we categorize as non-linear synthesis, entails an extra term that "corrects" the bias from misspecification and dependence in the predictive error process, effectively improving forecasts. Theoretical properties are examined and shown that this subclass improves the expected squared forecast error over any and all linear combination, averaging, and ensemble of forecasts, under mild conditions. We discuss the conditions for which this subclass outperforms others, and its implications for developing forecast combination methods. A finite sample simulation study is presented to illustrate our results.