Bayesian Learning
On the Difficulty of Selecting Ising Models with Approximate Recovery
Scarlett, Jonathan, Cevher, Volkan
In this paper, we consider the problem of estimating the underlying graph associated with an Ising model given a number of independent and identically distributed samples. We adopt an \emph{approximate recovery} criterion that allows for a number of missed edges or incorrectly-included edges, in contrast with the widely-studied exact recovery problem. Our main results provide information-theoretic lower bounds on the sample complexity for graph classes imposing constraints on the number of edges, maximal degree, and other properties. We identify a broad range of scenarios where, either up to constant factors or logarithmic factors, our lower bounds match the best known lower bounds for the exact recovery criterion, several of which are known to be tight or near-tight. Hence, in these cases, approximate recovery has a similar difficulty to exact recovery in the minimax sense. Our bounds are obtained via a modification of Fano's inequality for handling the approximate recovery criterion, along with suitably-designed ensembles of graphs that can broadly be classed into two categories: (i) Those containing graphs that contain several isolated edges or cliques and are thus difficult to distinguish from the empty graph; (ii) Those containing graphs for which certain groups of nodes are highly correlated, thus making it difficult to determine precisely which edges connect them. We support our theoretical results on these ensembles with numerical experiments.
Identifiability Assumptions and Algorithm for Directed Graphical Models with Feedback
Park, Gunwoong, Raskutti, Garvesh
Directed graphical models provide a useful framework for modeling causal or directional relationships for multivariate data. Prior work has largely focused on identifiability and search algorithms for directed acyclic graphical (DAG) models. In many applications, feedback naturally arises and directed graphical models that permit cycles occur. In this paper we address the issue of identifiability for general directed cyclic graphical (DCG) models satisfying the Markov assumption. In particular, in addition to the faithfulness assumption which has already been introduced for cyclic models, we introduce two new identifiability assumptions, one based on selecting the model with the fewest edges and the other based on selecting the DCG model that entails the maximum number of d-separation rules. We provide theoretical results comparing these assumptions which show that: (1) selecting models with the largest number of d-separation rules is strictly weaker than the faithfulness assumption; (2) unlike for DAG models, selecting models with the fewest edges does not necessarily result in a milder assumption than the faithfulness assumption. We also provide connections between our two new principles and minimality assumptions. We use our identifiability assumptions to develop search algorithms for small-scale DCG models. Our simulation study supports our theoretical results, showing that the algorithms based on our two new principles generally out-perform algorithms based on the faithfulness assumption in terms of selecting the true skeleton for DCG models.
An optimal learning method for developing personalized treatment regimes
A treatment regime is a function that maps individual patient information to a recommended treatment, hence explicitly incorporating the heterogeneity in need for treatment across individuals. Patient responses are dichotomous and can be predicted through an unknown relationship that depends on the patient information and the selected treatment. The goal is to find the treatments that lead to the best patient responses on average. Each experiment is expensive, forcing us to learn the most from each experiment. We adopt a Bayesian approach both to incorporate possible prior information and to update our treatment regime continuously as information accrues, with the potential to allow smaller yet more informative trials and for patients to receive better treatment. By formulating the problem as contextual bandits, we introduce a knowledge gradient policy to guide the treatment assignment by maximizing the expected value of information, for which an approximation method is used to overcome computational challenges. We provide a detailed study on how to make sequential medical decisions under uncertainty to reduce health care costs on a real world knee replacement dataset. We use clustering and LASSO to deal with the intrinsic sparsity in health datasets. We show experimentally that even though the problem is sparse, through careful selection of physicians (versus picking them at random), we can significantly improve the success rates.
Bayesian machine learning - FastML
So you know the Bayes rule. How does it relate to machine learning? It can be quite difficult to grasp how the puzzle pieces fit together - we know it took us a while. This article is an introduction we wish we had back then. While we have some grasp on the matter, we're not experts, so the following might contain inaccuracies or even outright errors. Feel free to point them out, either in the comments or privately.
A Single-Pass Classifier for Categorical Data
This paper describes a new method for classifying a dataset that partitions elements into their categories. It has relations with neural networks but a slightly different structure, requiring only a single pass through the classifier to generate the weight sets. A grid-like structure is required as part of a novel idea of converting a 1-D row of real values into a 2-D structure of value bands. Each cell in any band then stores a distinct set of weights, to represent its own importance and its relation to each output category. During classification, all of the output weight lists can be retrieved and summed to produce a probability for what the correct output category is. The bands possibly work like hidden layers of neurons, but they are variable specific, making the process orthogonal. The construction process can be a single update process without iterations, making it potentially much faster. It can also be compared with k-NN and may be practical for partial or competitive updating.
Automatic Variational ABC
Moreno, Alexander, Adel, Tameem, Meeds, Edward, Rehg, James M., Welling, Max
Approximate Bayesian Computation (ABC) is a framework for performing likelihood-free posterior inference for simulation models. Stochastic Variational inference (SVI) is an appealing alternative to the inefficient sampling approaches commonly used in ABC. However, SVI is highly sensitive to the variance of the gradient estimators, and this problem is exacerbated by approximating the likelihood. We draw upon recent advances in variance reduction for SVI [6][13] and likelihood-free inference using deterministic simulations [12] to produce low variance gradient estimators of the variational lower-bound. By then exploiting automatic differentiation libraries [8] we can avoid nearly all model-specific derivations. We demonstrate performance on three problems and compare to existing SVI algorithms. Our results demonstrate the correctness and efficiency of our algorithm.
Expectation propagation for continuous time stochastic processes
Cseke, Botond, Schnoerr, David, Opper, Manfred, Sanguinetti, Guido
Physical and technological processes frequently exhibit intrinsic stochasticity. The main mathematical framework to describe and reason about such systems is provided by the theory of continuous time (Markovian) stochastic processes. Such processes have been well studied in chemical physics for several decades as models of chemical reactions at very low concentrations [Gardiner, 1985, e.g.]. More recently, the theory has found novel and diverse areas of application including systems biology at the single cell level [Wilkinson, 2011], ecology [Volkov et al., 2007] and performance modelling in computer systems [Hillston, 2005], to name but a few. The popularity of the approach has been greatly enhanced by the availability of efficient and accurate simulation algorithms [Gillespie, 1977, Gillespie et al., 2013], which permit a numerical solution of medium-sized systems within a reasonable time frame. As with most of science, many of the application domains of continuous time stochastic processes are becoming increasingly data-rich, creating a critical demand for inference algorithms which can use data to calibrate the models and analyse the uncertainty in the predictions. This raises new challenges and opportunities for statistics and machine learning, and has motivated the development of several algorithms for efficient inference in these systems. In this paper, we focus on the Bayesian approach, and formulate the inverse problem in terms of obtaining an approximation to a posterior distribution over the stochastic process, given observations of the system and using existing scientific information to build a prior model of the process.
History of Data Mining
Data mining is everywhere, but its story starts many years before Moneyball and Edward Snowden. The following are major milestones and "firsts" in the history of data mining plus how it's evolved and blended with data science and big data. Data mining is the computational process of exploring and uncovering patterns in large data sets a.k.a. It is fundamental to data mining and probability, since it allows understanding of complex realities based on estimated probabilities. The goal of regression analysis is to estimate the relationships among variables, and the specific method they used in this case is the method of least squares.
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems): Ian H. Witten, Eibe Frank: 9780120884070: Amazon.com: Books
This book is very easy to read and understand. Unlike Hastie's Statistical Learning book, it is not geared towards those with an expert level knowledge of statistics, and instead takes time to explain functions and formulas for the person with a decent but not extrordinary understanding of statistical/math concepts. For example, their description of a Gaussian was the clearest I've seen. On the other hand, if you're math/statistics background is considerable, you may find this book somewhat simplistic or tedious. The book has a good coverage of techniques and algorithms, although I was somewhat disappointed that they do not mention Influence Diagrams, considering the amount of coverage of both decision trees and Bayesian techniques.