Uncertainty
A tree augmented naive Bayesian network experiment for breast cancer prediction
In order to investigate the breast cancer prediction problem on the aging population with the grades of DCIS, we conduct a tree augmented naive Bayesian network experiment trained and tested on a large clinical dataset including consecutive diagnostic mammography examinations, consequent biopsy outcomes and related cancer registry records in the population of women across all ages. Our tasks are to classify the conventional "Benign vs. Malignant" and the new "Benign/LG vs. IntG/HG/Invasive" based on mammography examination features and patient demographic information, specifically to predict the probability of malignancy, for the biopsy threshold setting and the biopsy decision making. The aggregated results of our tenfold cross validation method recommend a biopsy threshold higher than 2% for the aging population. The Receiver Operating Characteristic curves and the Precision-Recall curves by aggregating the tenfold cross validation results are interesting.
A hybrid algorithm for Bayesian network structure learning with application to multi-label learning
Gasse, Maxime, Aussem, Alex, Elghazel, Haytham
We present a novel hybrid algorithm for Bayesian network structure learning, called H2PC. It first reconstructs the skeleton of a Bayesian network and then performs a Bayesian-scoring greedy hill-climbing search to orient the edges. The algorithm is based on divide-and-conquer constraint-based subroutines to learn the local structure around a target variable. We conduct two series of experimental comparisons of H2PC against Max-Min Hill-Climbing (MMHC), which is currently the most powerful state-of-the-art algorithm for Bayesian network structure learning. First, we use eight well-known Bayesian network benchmarks with various data sizes to assess the quality of the learned structure returned by the algorithms. Our extensive experiments show that H2PC outperforms MMHC in terms of goodness of fit to new data and quality of the network structure with respect to the true dependence structure of the data. Second, we investigate H2PC's ability to solve the multi-label learning problem. We provide theoretical results to characterize and identify graphically the so-called minimal label powersets that appear as irreducible factors in the joint distribution under the faithfulness condition. The multi-label learning problem is then decomposed into a series of multi-class classification problems, where each multi-class variable encodes a label powerset. H2PC is shown to compare favorably to MMHC in terms of global classification accuracy over ten multi-label data sets covering different application domains. Overall, our experiments support the conclusions that local structural learning with H2PC in the form of local neighborhood induction is a theoretically well-motivated and empirically effective learning framework that is well suited to multi-label learning. The source code (in R) of H2PC as well as all data sets used for the empirical tests are publicly available.
FastMMD: Ensemble of Circular Discrepancy for Efficient Two-Sample Test
The maximum mean discrepancy (MMD) is a recently proposed test statistic for two-sample test. Its quadratic time complexity, however, greatly hampers its availability to large-scale applications. To accelerate the MMD calculation, in this study we propose an efficient method called FastMMD. The core idea of FastMMD is to equivalently transform the MMD with shift-invariant kernels into the amplitude expectation of a linear combination of sinusoid components based on Bochner's theorem and Fourier transform (Rahimi & Recht, 2007). Taking advantage of sampling of Fourier transform, FastMMD decreases the time complexity for MMD calculation from $O(N^2 d)$ to $O(L N d)$, where $N$ and $d$ are the size and dimension of the sample set, respectively. Here $L$ is the number of basis functions for approximating kernels which determines the approximation accuracy. For kernels that are spherically invariant, the computation can be further accelerated to $O(L N \log d)$ by using the Fastfood technique (Le et al., 2013). The uniform convergence of our method has also been theoretically proved in both unbiased and biased estimates. We have further provided a geometric explanation for our method, namely ensemble of circular discrepancy, which facilitates us to understand the insight of MMD, and is hopeful to help arouse more extensive metrics for assessing two-sample test. Experimental results substantiate that FastMMD is with similar accuracy as exact MMD, while with faster computation speed and lower variance than the existing MMD approximation methods.
Convex Risk Minimization and Conditional Probability Estimation
Telgarsky, Matus, Dudรญk, Miroslav, Schapire, Robert
This paper proves, in very general settings, that convex risk minimization is a procedure to select a unique conditional probability model determined by the classification problem. Unlike most previous work, we give results that are general enough to include cases in which no minimum exists, as occurs typically, for instance, with standard boosting algorithms. Concretely, we first show that any sequence of predictors minimizing convex risk over the source distribution will converge to this unique model when the class of predictors is linear (but potentially of infinite dimension). Secondly, we show the same result holds for \emph{empirical} risk minimization whenever this class of predictors is finite dimensional, where the essential technical contribution is a norm-free generalization bound.
On the properties of variational approximations of Gibbs posteriors
Alquier, Pierre, Ridgway, James, Chopin, Nicolas
The PAC-Bayesian approach is a powerful set of techniques to derive non- asymptotic risk bounds for random estimators. The corresponding optimal distribution of estimators, usually called the Gibbs posterior, is unfortunately intractable. One may sample from it using Markov chain Monte Carlo, but this is often too slow for big datasets. We consider instead variational approximations of the Gibbs posterior, which are fast to compute. We undertake a general study of the properties of such approximations. Our main finding is that such a variational approximation has often the same rate of convergence as the original PAC-Bayesian procedure it approximates. We specialise our results to several learning tasks (classification, ranking, matrix completion),discuss how to implement a variational approximation in each case, and illustrate the good properties of said approximation on real datasets.
On the Generalization of the C-Bound to Structured Output Ensemble Methods
Laviolette, Franรงois, Morvant, Emilie, Ralaivola, Liva, Roy, Jean-Francis
It is well-known that learning predictive models capable of dealing with outputs that are richer than binary outputs (e.g., multiclass or multilabel) and for which theoretical guarantees exist is still a realm of intensive investigations. From a practical standpoint, a lot of relaxations for learning with complex outputs have been devised. A common approach consists in decomposing the output space into "simpler" spaces so that the learning problem at hand can be reduced to a few easier (i.e., binary) learning tasks. For instance, this is the idea spurred by the Error-Correcting Output Codes (Dietterich & Bakiri, 1995) that makes possible to reduce multiclass or multilabel problems into binary classification tasks,e.g., (Allwein et al., 2001; Mroueh et al., 2012; Read et al., 2011; Tsoumakas & Vlahavas, 2007; Zhang & Schneider, 2012). In our work, we study the problem of complex output prediction by focusing on prediction functions that take the form of a weighted majority vote over a set of complex output classifiers (or voters). Recall that ensemble methods can all be seen as majority vote learning procedures (Dietterich, 2000; Re & Valentini, 2012). Methods such as Bagging (Breiman, 1996), Boosting (Schapire & Singer, 1999) and Random Forests (Breiman, 2001) are representative voting methods. Cortes et al. (2014) have proposed various ensemble methods for the structured output prediction framework. Note also that majority votes are also central to the Bayesian approach (Gelman et al., 2004) with the notion of Bayesian model averaging (Domingos, 2000; Haussler et al., 1994) and most of kernel-based predictors, such as the Support Vector Machines (Boser et al., 1992; Cortes & Vapnik, 1995) may be viewed as weighted majority votes as well: for binary classification, where the predicted class for some input x is computed as the sign of
Consistency and fluctuations for stochastic gradient Langevin dynamics
Teh, Yee Whye, Thiรฉry, Alexandre, Vollmer, Sebastian
Applying standard Markov chain Monte Carlo (MCMC) algorithms to large data sets is computationally expensive. Both the calculation of the acceptance probability and the creation of informed proposals usually require an iteration through the whole data set. The recently proposed stochastic gradient Langevin dynamics (SGLD) method circumvents this problem by generating proposals which are only based on a subset of the data, by skipping the accept-reject step and by using decreasing step-sizes sequence $(\delta_m)_{m \geq 0}$. %Under appropriate Lyapunov conditions, We provide in this article a rigorous mathematical framework for analysing this algorithm. We prove that, under verifiable assumptions, the algorithm is consistent, satisfies a central limit theorem (CLT) and its asymptotic bias-variance decomposition can be characterized by an explicit functional of the step-sizes sequence $(\delta_m)_{m \geq 0}$. We leverage this analysis to give practical recommendations for the notoriously difficult tuning of this algorithm: it is asymptotically optimal to use a step-size sequence of the type $\delta_m \asymp m^{-1/3}$, leading to an algorithm whose mean squared error (MSE) decreases at rate $\mathcal{O}(m^{-1/3})$
MCMC for Variationally Sparse Gaussian Processes
Hensman, James, Matthews, Alexander G. de G., Filippone, Maurizio, Ghahramani, Zoubin
Gaussian process (GP) models form a core part of probabilistic machine learning. Considerable research effort has been made into attacking three issues with GP models: how to compute efficiently when the number of data is large; how to approximate the posterior when the likelihood is not Gaussian and how to estimate covariance function parameter posteriors. This paper simultaneously addresses these, using a variational approximation to the posterior which is sparse in support of the function but otherwise free-form. The result is a Hybrid Monte-Carlo sampling scheme which allows for a non-Gaussian approximation over the function values and covariance parameters simultaneously, with efficient computations based on inducing-point sparse GPs. Code to replicate each experiment in this paper will be available shortly.
Automatic Variational Inference in Stan
Kucukelbir, Alp, Ranganath, Rajesh, Gelman, Andrew, Blei, David M.
Variational inference is a scalable technique for approximate Bayesian inference. Deriving variational inference algorithms requires tedious model-specific calculations; this makes it difficult to automate. We propose an automatic variational inference algorithm, automatic differentiation variational inference (ADVI). The user only provides a Bayesian model and a dataset; nothing else. We make no conjugacy assumptions and support a broad class of models. The algorithm automatically determines an appropriate variational family and optimizes the variational objective. We implement ADVI in Stan (code available now), a probabilistic programming framework. We compare ADVI to MCMC sampling across hierarchical generalized linear models, nonconjugate matrix factorization, and a mixture model. We train the mixture model on a quarter million images. With ADVI we can use variational inference on any model we write in Stan.
Probabilistic Curve Learning: Coulomb Repulsion and the Electrostatic Gaussian Process
Learning of low dimensional structure in multidimensional data is a canonical problem in machine learning. One common approach is to suppose that the observed data are close to a lower-dimensional smooth manifold. There are a rich variety of manifold learning methods available, which allow mapping of data points to the manifold. However, there is a clear lack of probabilistic methods that allow learning of the manifold along with the generative distribution of the observed data. The best attempt is the Gaussian process latent variable model (GP-LVM), but identifiability issues lead to poor performance. We solve these issues by proposing a novel Coulomb repulsive process (Corp) for locations of points on the manifold, inspired by physical models of electrostatic interactions among particles. Combining this process with a GP prior for the mapping function yields a novel electrostatic GP (electroGP) process. Focusing on the simple case of a one-dimensional manifold, we develop efficient inference algorithms, and illustrate substantially improved performance in a variety of experiments including filling in missing frames in video.