Bayesian Inference
Learning Models with Uniform Performance via Distributionally Robust Optimization
Duchi, John, Namkoong, Hongseok
In many applications of statistics and machine learning, we wish to learn models that achieve uniformly good performance over almost all input values. This is important for safety-and fairnesscritical systems such as medical diagnosis, autonomous vehicles, criminal justice and credit evaluations, where poor performance on the tails of the inputs leads to high-cost system failures. Methods that optimize average performance, however, often produce models that suffer low performance on the "hard" instances of the population. For example, standard regressors obtained from maximum likelihood estimation can lose their predictive power on certain regions of covariates [57], so that high average performance comes at the expense of low performance on minority subpopulations. In this work, we propose and study a procedure that explicitly optimizes performance on tail inputs that suffer high loss. Modern datasets incorporate heterogeneous (but latent) subpopulations, and a natural goal is to perform well across all of these [57, 65, 21]. While many statistical models show strong average performance, their performance often deteriorates on minority groups underrepresented in the dataset. For example, speech recognition systems are inaccurate for people with minority accents [4]. In numerous other applications--such as facial recognition, automatic video captioning, language identification, academic recommender systems--performance varies significantly over different demographic groupings, such as race, gender, or age [38, 42, 18, 68, 76].
Renormalized Normalized Maximum Likelihood and Three-Part Code Criteria For Learning Gaussian Networks
Alipourfard, Borzou, Gao, Jean X.
Score based learning (SBL) is a promising approach for learning Bayesian networks in the discrete domain. However, when employing SBL in the continuous domain, one is either forced to move the problem to the discrete domain or use metrics such as BIC/AIC, and these approaches are often lacking. Discretization can have an undesired impact on the accuracy of the results, and BIC/AIC can fall short of achieving the desired accuracy. In this paper, we introduce two new scoring metrics for scoring Bayesian networks in the continuous domain: the three-part minimum description length and the renormalized normalized maximum likelihood metric. We rely on the minimum description length principle in formulating these metrics. The metrics proposed are free of hyperparameters, decomposable, and are asymptotically consistent. We evaluate our solution by studying the convergence rate of the learned graph to the generating network and, also, the structural hamming distance of the learned graph to the generating network. Our evaluations show that the proposed metrics outperform their competitors, the BIC/AIC metrics. Furthermore, using the proposed RNML metric, SBL will have the fastest rate of convergence with the smallest structural hamming distance to the generating network.
Nonparametric Bayesian Lomax delegate racing for survival analysis with competing risks
We propose Lomax delegate racing (LDR) to explicitly model the mechanism of survival under competing risks and to interpret how the covariates accelerate or decelerate the time to event. LDR explains non-monotonic covariate effects by racing a potentially infinite number of sub-risks, and consequently relaxes the ubiquitous proportional-hazards assumption which may be too restrictive. Moreover, LDR is naturally able to model not only censoring, but also missing event times or event types. For inference, we develop a Gibbs sampler under data augmentation for moderately sized data, along with a stochastic gradient descent maximum a posteriori inference algorithm for big data applications. Illustrative experiments are provided on both synthetic and real datasets, and comparison with various benchmark algorithms for survival analysis with competing risks demonstrates distinguished performance of LDR.
Infinite Factorial Finite State Machine for Blind Multiuser Channel Estimation
Ruiz, Francisco J. R., Valera, Isabel, Svensson, Lennart, Perez-Cruz, Fernando
New communication standards need to deal with machine-to-machine communications, in which users may start or stop transmitting at any time in an asynchronous manner. Thus, the number of users is an unknown and time-varying parameter that needs to be accurately estimated in order to properly recover the symbols transmitted by all users in the system. In this paper, we address the problem of joint channel parameter and data estimation in a multiuser communication channel in which the number of transmitters is not known. For that purpose, we develop the infinite factorial finite state machine model, a Bayesian nonparametric model based on the Markov Indian buffet that allows for an unbounded number of transmitters with arbitrary channel length. We propose an inference algorithm that makes use of slice sampling and particle Gibbs with ancestor sampling. Our approach is fully blind as it does not require a prior channel estimation step, prior knowledge of the number of transmitters, or any signaling information. Our experimental results, loosely based on the LTE random access channel, show that the proposed approach can effectively recover the data-generating process for a wide range of scenarios, with varying number of transmitters, number of receivers, constellation order, channel length, and signal-to-noise ratio.
Prediction of Atomization Energy Using Graph Kernel and Active Learning
Tang, Yu-Hang, de Jong, Wibe A.
Data-driven prediction of molecular properties presents unique challenges to the design of machine learning methods concerning data structure/dimensionality, symmetry adaption, and confidence management. In this paper, we present a kernel-based pipeline that can learn and predict the atomization energy of molecules with high accuracy. The framework employs Gaussian process regression to perform predictions based on the similarity between molecules, which is computed using the marginalized graph kernel. We discuss why the graph kernel, paired with a graph representation of the molecules, is particularly useful for predicting extensive properties. We demonstrate that using an active learning procedure, the proposed method can achieve a mean absolute error less than 1.0 kcal/mol on the QM7 data set using as few as 1200 training samples and 1 hour of training time. This is a demonstration, in contrast to common believes, that regression models based on kernel methods can be simultaneously accurate and fast predictors.
Visions of a generalized probability theory
In this Book we argue that the fruitful interaction of computer vision and belief calculus is capable of stimulating significant advances in both fields. From a methodological point of view, novel theoretical results concerning the geometric and algebraic properties of belief functions as mathematical objects are illustrated and discussed in Part II, with a focus on both a perspective 'geometric approach' to uncertainty and an algebraic solution to the issue of conflicting evidence. In Part III we show how these theoretical developments arise from important computer vision problems (such as articulated object tracking, data association and object pose estimation) to which, in turn, the evidential formalism is able to provide interesting new solutions. Finally, some initial steps towards a generalization of the notion of total probability to belief functions are taken, in the perspective of endowing the theory of evidence with a complete battery of estimation and inference tools to the benefit of all scientists and practitioners.
Approximate Dynamic Programming for Planning a Ride-Sharing System using Autonomous Fleets of Electric Vehicles
Al-Kanj, Lina, Nascimento, Juliana, Powell, Warren B.
Within a decade, almost every major auto company, along with fleet operators such as Uber, have announced plans to put autonomous vehicles on the road. At the same time, electric vehicles are quickly emerging as a next-generation technology that is cost effective, in addition to offering the benefits of reducing the carbon footprint. The combination of a centrally managed fleet of driverless vehicles, along with the operating characteristics of electric vehicles, is creating a transformative new technology that offers significant cost savings with high service levels. This problem involves a dispatch problem for assigning riders to cars, a planning problem for deciding on the fleet size, and a surge pricing problem for deciding on the price per trip. In this work, we propose to use approximate dynamic programming to develop high-quality operational dispatch strategies to determine which car (given the battery level) is best for a particular trip (considering its length and destination), when a car should be recharged, and when it should be re-positioned to a different zone which offers a higher density of trips. We then discuss surge pricing using an adaptive learning approach to decide on the price for each trip. Finally, we discuss the fleet size problem which depends on the previous two problems.
Generalized Earthquake Frequency-Magnitude Distribution Described by Asymmetric Laplace Mixture Modelling
The complete part of the earthquake frequency-magnitude distribution (FMD), above completeness magnitude mc, is well described by the Gutenberg-Richter law. The parameter mc however varies in space due to the seismic network configuration, yielding a convoluted FMD shape below max(mc). This paper investigates the shape of the generalized FMD (GFMD), which may be described as a mixture of elemental FMDs (eFMDs) defined as asymmetric Laplace distributions of mode mc [Mignan, 2012, https://doi.org/10.1029/2012JB009347]. An asymmetric Laplace mixture model (GFMD- ALMM) is thus proposed with its parameters (detection parameter kappa, Gutenberg-Richter beta-value, mc distribution, as well as number K and weight w of eFMD components) estimated using a semi-supervised hard expectation maximization approach including BIC penalties for model complexity. The performance of the proposed method is analysed, with encouraging results obtained: kappa, beta, and the mc distribution range are retrieved for different GFMD shapes in simulations, as well as in regional catalogues (southern and northern California, Nevada, Taiwan, France), in a global catalogue, and in an aftershock sequence (Christchurch, New Zealand). We find max(mc) to be conservative compared to other methods, kappa = k/log(10) = 3 in most catalogues (compared to beta = b/log(10) = 1), but also that biases in kappa and beta may occur when rounding errors are present below completeness. The GFMD-ALMM, by modelling different FMD shapes in an autonomous manner, opens the door to new statistical analyses in the realm of incomplete seismicity data, which could in theory improve earthquake forecasting by considering c. ten times more events.
Multimodal Deep Gaussian Processes
Kaiser, Markus, Otte, Clemens, Runkler, Thomas, Ek, Carl Henrik
We propose a novel Bayesian approach to modelling multimodal data generated by multiple independent processes, simultaneously solving the data association and induced supervised learning problems. Underpinning our approach is the use of Gaussian process priors which encode structure both on the functions and the associations themselves. The association of samples and functions are determined by taking both inputs and outputs into account while also obtaining a posterior belief about the relevance of the global components throughout the input space. We present an efficient learning scheme based on doubly stochastic variational inference and discuss how it can be applied to deep Gaussian process priors. We show results for an artificial data set, a noise separation problem, and a multimodal regression problem based on the cart-pole benchmark.
Metropolis-Hastings view on variational inference and adversarial training
Neklyudov, Kirill, Shvechikov, Pavel, Vetrov, Dmitry
In this paper we propose to view the acceptance rate of the Metropolis-Hastings algorithm as a universal objective for learning to sample from target distribution - given either as a set of samples or in the form of unnormalized density. To reveal the connection we derive the lower bound on the acceptance rate and treat it as the objective for learning explicit and implicit samplers. The form of the lower bound allows for doubly stochastic gradient optimization in case the target distribution factorizes (i.e. over data points). Bayesian framework and deep learning have become more and more interrelated during recent years. Recently Bayesian deep neural networks were used for estimating uncertainty (Gal & Ghahramani, 2016), ensembling (Gal & Ghahramani, 2016) and model compression (Molchanov et al., 2017). On the other hand, deep neural networks may be used to improve approximate inference in Bayesian models (Kingma & Welling, 2014). Learning modern Bayesian neural networks requires inference in the spaces with dimension up to several million by conditioning the weights of DNN on hundreds of thousands of objects.