Directed Networks
A multiple testing framework for diagnostic accuracy studies with co-primary endpoints
Westphal, Max, Zapf, Antonia, Brannath, Werner
This is indicated, among others, by several review and overview publications (Ching et al., 2018; Jiang et al., 2017; Litjens et al., 2017; Miotto, Wang, Wang, Jiang, & Dudley, 2017). In particular, the capabilities of end-to-end deep learning approaches on such supervised learning tasks are highly promising. For instance, vast advances have been reported in the literature regarding cancer diagnosis with deep neural networks (Hu et al., 2018). End-to-end deep learning refers to a trend involving deep (neural network) model architectures which are able to learn highly complex relationships between predictors and the target variable while having less parameters than traditional (more shallow) models with comparable performance (Goodfellow, Bengio, & Courville, 2016). In the training process, highly complex features are derived automatically by the learning algorithm (LeCun, Bengio, & Hinton, 2015). This framework contrasts the traditional pipeline of domain specific data preprocessing and handcrafted features in combination with simpler prediction models. Despite all the recent success of machine learning, there are still challenges regarding over-optimistic conclusions drawn from finite datasets which may to a large extend be attributed to the following two (broad) categories: 1. Study design and reporting: The most popular recommendation to split data for training, selection and evaluation is frequently employed in practice (Friedman, Hastie, & Tibshirani, 2009; Géron, 2017; Goodfellow et al., 2016; Japkowicz & Shah, 2011; Kuhn & Johnson, 2013; Zheng, 2015). In the ML community, the according datasets are commonly denoted as training, validation and test set.
Uncertainty relations and fluctuation theorems for Bayes nets
The pioneering paper [Ito and Sagawa, 2013] analyzed the non-equilibrium statistical physics of a set of multiple interacting systems, S, whose joint discrete-time evolution is specified by a Bayesian network. The major result of [Ito and Sagawa, 2013] was an integral fluctuation theorem (IFT) governing the sum of two quantities: the entropy production (EP) of an arbitrary single v in S, and the transfer entropy from v to the other systems. Here I extend the analysis in [Ito and Sagawa, 2013]. I derive several detailed fluctuation theorems (DFTs), concerning arbitrary subsets of all the systems (including the full set). I also derive several associated IFTs, concerning an arbitrary subset of the systems, thereby extending the IFT in [Ito and Sagawa, 2013]. In addition I derive "conditional" DFTs and IFTs, involving conditional probability distributions rather than (as in conventional fluctuation theorems) unconditioned distributions. I then derive thermodynamic uncertainty relations relating the total EP of the Bayes net to the set of all the precisions of probability currents within the individual systems. I end with an example of that uncertainty relation.
Probabilistic Similarity Networks
Normative expert systems have not become commonplace because they have been difficult to build and use. Over the past decade, however, researchers have developed the influence diagram, a graphical representation of a decision maker's beliefs, alternatives, and preferences that serves as the knowledge base of a normative expert system. Most people who have seen the representation find it intuitive and easy to use. Consequently, the influence diagram has overcome significantly the barriers to constructing normative expert systems. Nevertheless, building influence diagrams is not practical for extremely large and complex domains. In this book, I address the difficulties associated with the construction of the probabilistic portion of an influence diagram, called a knowledge map, belief network, or Bayesian network. I introduce two representations that facilitate the generation of large knowledge maps. In particular, I introduce the similarity network, a tool for building the network structure of a knowledge map, and the partition, a tool for assessing the probabilities associated with a knowledge map. I then use these representations to build Pathfinder, a large normative expert system for the diagnosis of lymph-node diseases (the domain contains over 60 diseases and over 100 disease findings). In an early version of the system, I encoded the knowledge of the expert using an erroneous assumption that all disease findings were independent, given each disease. When the expert and I attempted to build a more accurate knowledge map for the domain that would capture the dependencies among the disease findings, we failed. Using a similarity network, however, we built the knowledge-map structure for the entire domain in approximately 40 hours. Furthermore, the partition representation reduced the number of probability assessments required by the expert from 75,000 to 14,000.
Coverage-based Outlier Explanation
Wu, Yue, Akoglu, Leman, Davidson, Ian
Outlier detection is a core task in data mining with a plethora of algorithms that have enjoyed wide scale usage. Existing algorithms are primarily focused on detection, that is the identification of outliers in a given dataset. In this paper we explore the relatively under-studied problem of the outlier explanation problem. Our goal is, given a dataset that is already divided into outliers and normal instances, explain what characterizes the outliers. We explore the novel direction of a semantic explanation that a domain expert or policy maker is able to understand. We formulate this as an optimization problem to find explanations that are both interpretable and pure. Through experiments on real-world data sets, we quantitatively show that our method can efficiently generate better explanations compared with rule-based learners.
Computational Separations between Sampling and Optimization
Two commonly arising computational tasks in Bayesian learning are Optimization (Maximum A Posteriori estimation) and Sampling (from the posterior distribution). In the convex case these two problems are efficiently reducible to each other. Recent work (Ma et al. 2019) shows that in the non-convex case, sampling can sometimes be provably faster. We present a simpler and stronger separation. We then compare sampling and optimization in more detail and show that they are provably incomparable: there are families of continuous functions for which optimization is easy but sampling is NP-hard, and vice versa. Further, we show function families that exhibit a sharp phase transition in the computational complexity of sampling, as one varies the natural temperature parameter. Our results draw on a connection to analogous separations in the discrete setting which are well-studied.
Guided Layer-wise Learning for Deep Models using Side Information
Sulimov, Pavel, Sukmanova, Elena, Chereshnev, Roman, Kertesz-Farkas, Attila
Training of deep models for classification tasks is hindered by local minima problems and vanishing gradients, while unsupervised layer-wise pretraining does not exploit information from class labels. Here, we propose a new regularization technique, called diversifying regularization (DR), which applies a penalty on hidden units at any layer if they obtain similar features for different types of data. For generative models, DR is defined as divergence over the variational posteriori distributions and included in the maximum likelihood estimation as a prior. Thus, DR includes class label information for greedy pretraining of deep belief networks which result in a better weight initialization for fine-tuning methods. On the other hand, for discriminative training of deep neural networks, DR is defined as a distance over the features and included in the learning objective. With our experimental tests, we show that DR can help the backpropagation to cope with vanishing gradient problems and to provide faster convergence and smaller generalization errors.
GP-ALPS: Automatic Latent Process Selection for Multi-Output Gaussian Process Models
Berkovich, Pavel, Perim, Eric, Bruinsma, Wessel
Wessel Bruinsma ‡ wpb23@cam.ac.uk 1. Introduction A principled approach to prediction tasks is to choose a statistical model that explains the data. The choice of the model class is crucial and has to observe the bias-variance tradeoff, which motivates the need for principled approaches to selecting the best model class from a set of options. Whilst model selection can be done manually by trial and error, the process tends to consume considerable time and resources and be prone to human biases. Bayesian model selection (MacKay, 1992; Rasmussen and Ghahramani, 2001), treats the model class as a random variable and computes its posterior distribution. It offers a built-in complexity regulariser, commonly known as Bayesian Occams razor, which penalises models whose complexity is excessive or too modest.
Scalable Variational Gaussian Processes for Crowdsourcing: Glitch Detection in LIGO
Morales-Álvarez, Pablo, Ruiz, Pablo, Coughlin, Scott, Molina, Rafael, Katsaggelos, Aggelos K.
In the last years, crowdsourcing is transforming the way classification training sets are obtained. Instead of relying on a single expert annotator, crowdsourcing shares the labelling effort among a large number of collaborators. For instance, this is being applied to the data acquired by the laureate Laser Interferometer Gravitational Waves Observatory (LIGO), in order to detect glitches which might hinder the identification of true gravitational-waves. The crowdsourcing scenario poses new challenging difficulties, as it deals with different opinions from a heterogeneous group of annotators with unknown degrees of expertise. Probabilistic methods, such as Gaussian Processes (GP), have proven successful in modeling this setting. However, GPs do not scale well to large data sets, which hampers their broad adoption in real practice (in particular at LIGO). This has led to the recent introduction of deep learning based crowdsourcing methods, which have become the state-of-the-art. However, the accurate uncertainty quantification of GPs has been partially sacrificed. This is an important aspect for astrophysicists in LIGO, since a glitch detection system should provide very accurate probability distributions of its predictions. In this work, we leverage the most popular sparse GP approximation to develop a novel GP based crowdsourcing method that factorizes into mini-batches. This makes it able to cope with previously-prohibitive data sets. The approach, which we refer to as Scalable Variational Gaussian Processes for Crowdsourcing (SVGPCR), brings back GP-based methods to the state-of-the-art, and excels at uncertainty quantification. SVGPCR is shown to outperform deep learning based methods and previous probabilistic approaches when applied to the LIGO data. Moreover, its behavior and main properties are carefully analyzed in a controlled experiment based on the MNIST data set.
Training Neural Networks for Likelihood/Density Ratio Estimation
Moustakides, George V., Basioti, Kalliopi
V arious problems in Engineering and Statistics require the computation of the likelihood ratio function of two probability densities. In classical approaches the two densities are assumed known or to belong to some known parametric family. In a data-driven version we replace this requirement with the availability of data sampled from the densities of interest. For most well known problems in Detection and Hypothesis testing we develop solutions by providing neural network based estimates of the likelihood ratio or its transformations. This task necessitates the definition of proper optimizations which can be used for the training of the network. The main purpose of this work is to offer a simple and unified methodology for defining such optimization problems with guarantees that the solution is indeed the desired function. Our results are extended to cover estimates for likelihood ratios of conditional densities and estimates for statistics encountered in local approaches. HE likelihood ratio of two probability densities is a function that appears in a variety of problems in Engineering and Statistics. Characteristic examples [1], [2] constitute Hypothesis testing, Signal detection, Sequential hypothesis testing, Sequential detection of changes, etc. Many of these problems also use the likelihood ratio under a transformed form with the most frequent example being the log-likelihood ratio. In all these problems the main assumption is that the corresponding probability densities are available under some functional form. What we aim in this work is to replace this requirement with the availability of data sampled from each of the densities of interest. As we mentioned, the computation of the likelihood ratio function relies on the knowledge of the probability densities which, for the majority of applications, is an unrealistic assumption. One can instead propose parametric families of densities and, with the help of available data, estimate the parameters and form the likelihood ratio function. However, with the advent of Data Science and Deep Learning there is a phenomenal increase in need for processing data coming from images, videos etc. For most of these cases it is very difficult to propose any meaningful parametric family of densities that could reliably describe their statistical behavior. Therefore, these techniques tend to be unsuitable for most of these datasets. If parametric families cannot be employed one can always resort to nonparametric density estimation [3] and then form the likelihood ratio. These approaches are purely data-driven but require two different approximations, namely one for each density.
Scalable Deep Generative Relational Models with High-Order Node Dependence
Fan, Xuhui, Li, Bin, Sisson, Scott Anthony, Li, Caoyuan, Chen, Ling
We propose a probabilistic framework for modelling and exploring the latent structure of relational data. Given feature information for the nodes in a network, the scalable deep generative relational model (SDREM) builds a deep network architecture that can approximate potential nonlinear mappings between nodes' feature information and the nodes' latent representations. Our contribution is two-fold: (1) We incorporate high-order neighbourhood structure information to generate the latent representations at each node, which vary smoothly over the network. (2) Due to the Dirichlet random variable structure of the latent representations, we introduce a novel data augmentation trick which permits efficient Gibbs sampling. The SDREM can be used for large sparse networks as its computational cost scales with the number of positive links. We demonstrate its competitive performance through improved link prediction performance on a range of real-world datasets.