This paper investigates the asymptotic distribution of the K-fold cross validation error in an i.i.d. setting. As the number of observations n goes to infinity while keeping the number of folds K fixed, the K-fold cross validation error is √ n-consistent for the expected out-of-sample error and has an asymptotically normal distribution. A consistent estimate of the asymptotic variance is derived and used to construct asymptotically valid confidence intervals for the expected out-of-sample error. A hypothesis test is developed for comparing two estimators’ expected out-of-sample errors and a subsampling procedure is used to obtain critical values. Monte Carlo simulations demonstrate the asymptotic validity of our confidence intervals for the expected out-of-sample error and investigate the size and power properties of our test. In our empirical application, we use our estimator selection test to compare the out-of-sample predictive performance of OLS, Neural Networks, and Random Forests for predicting the sale price of a domain name in a GoDaddy expiry auction.
The core principle of Variational Inference (VI) is to convert the statistical inference problem of computing complex posterior probability densities into a tractable optimization problem. This property enables VI to be faster than several sampling-based techniques. However, the traditional VI algorithm is not scalable to large data sets and is unable to readily infer out-of-bounds data points without re-running the optimization process. Recent developments in the field, like stochastic-, black box-, and amortized-VI, have helped address these issues. Generative modeling tasks nowadays widely make use of amortized VI for its efficiency and scalability, as it utilizes a parameterized function to learn the approximate posterior density parameters. In this paper, we review the mathematical foundations of various VI techniques to form the basis for understanding amortized VI. Additionally, we provide an overview of the recent trends that address several issues of amortized VI, such as the amortization gap, generalization issues, inconsistent representation learning, and posterior collapse. Finally, we analyze alternate divergence measures that improve VI optimization.
Diffusion models have recently emerged as the de facto standard for generating complex, high-dimensional outputs. You may know them for their ability to produce stunning AI art and hyper-realistic synthetic images, but they have also found success in other applications such as drug design and continuous control. The key idea behind diffusion models is to iteratively transform random noise into a sample, such as an image or protein structure. This is typically motivated as a maximum likelihood estimation problem, where the model is trained to generate samples that match the training data as closely as possible. However, most use cases of diffusion models are not directly concerned with matching the training data, but instead with a downstream objective.
This paper presents alternative techniques for inference on classical Bayesian networks in which all probabilities are fixed, and for synthesis problems when conditional probability tables (CPTs) in such networks contain symbolic parameters rather than concrete probabilities. The key idea is to exploit probabilistic model checking as well as its recent extension to parameter synthesis techniques for parametric Markov chains. To enable this, the Bayesian networks are transformed into Markov chains and their objectives are mapped onto probabilistic temporal logic formulas. For exact inference, we compare probabilistic model checking to weighted model counting on various Bayesian network benchmarks. We contrast symbolic model checking using multi-terminal binary (aka: algebraic) decision diagrams to symbolic inference using proba- bilistic sentential decision diagrams, symbolic data structures that are tailored to Bayesian networks. For the parametric setting, we describe how our techniques can be used for various synthesis problems such as computing sensitivity functions (and values), simple and difference parameter tuning and ratio parameter tuning. Our parameter synthesis techniques are applicable to arbitrarily many, possibly dependent, parameters that may occur in multiple CPTs. This lifts restrictions, e.g., on the number of parametrized CPTs, or on parameter dependencies between several CPTs, that exist in the literature. Experiments on several benchmarks show that our parameter synthesis techniques can treat parameter synthesis for Bayesian networks (with hundreds of unknown parameters) that are out of reach for existing techniques.
Statistical relational AI and probabilistic logic programming have so far mostly focused on discrete probabilistic models. The reasons for this is that one needs to provide constructs to succinctly model the independencies in such models, and also provide efficient inference. Three types of independencies are important to represent and exploit for scalable inference in hybrid models: conditional independencies elegantly modeled in Bayesian networks, context-specific independencies naturally represented by logical rules, and independencies amongst attributes of related objects in relational models succinctly expressed by combining rules. This paper introduces a hybrid probabilistic logic programming language, DC#, which integrates distributional clauses' syntax and semantics principles of Bayesian logic programs. It represents the three types of independencies qualitatively. More importantly, we also introduce the scalable inference algorithm FO-CS-LW for DC#. FO-CS-LW is a first-order extension of the context-specific likelihood weighting algorithm (CS-LW), a novel sampling method that exploits conditional independencies and context-specific independencies in ground models. The FO-CS-LW algorithm upgrades CS-LW with unification and combining rules to the first-order case.
In today's rapidly advancing world of Artificial Intelligence (AI), the need for explainable AI has become more critical than ever. As AI systems are being increasingly integrated into various aspects of our daily lives, it is crucial to understand how these systems make decisions and provide explanations for their actions. Bayesian networks, a powerful and versatile graphical modeling technique, are gaining prominence as a tool for building explainable AI models. In this blog, we will demystify Bayesian networks and explore their relevance in the field of AI. We will delve into the fundamentals of Bayesian networks, their applications in AI, and how they enable explainable AI.
In simple perceptual decisions the brain has to identify a stimulus based on noisy sensory samples from the stimulus. Basic statistical considerations state that the reliability of the stimulus information, i.e., the amount of noise in the samples, should be taken into account when the decision is made. However, for perceptual decision making experiments it has been questioned whether the brain indeed uses the reliability for making decisions when confronted with unpredictable changes in stimulus reliability. We here show that even the basic drift diffusion model, which has frequently been used to explain experimental findings in perceptual decision making, implicitly relies on estimates of stimulus reliability. We then show that only those variants of the drift diffusion model which allow stimulusspecific reliabilities are consistent with neurophysiological findings. Our analysis suggests that the brain estimates the reliability of the stimulus on a short time scale of at most a few hundred milliseconds.
When the probability distribution P(X, Y) is known, the optimal classifier, leading to the minimum misclassification rate, is given by the Maximum A-posteriori Probability (MAP) decision rule. However, in practice, estimating the complete joint distribution P(X, Y) is computationally and statistically impossible for large values of d. Therefore, an alternative approach is to first estimate some low order marginals of the joint probability distribution P(X, Y) and then design the classifier based on the estimated low order marginals. This approach is also helpful when the complete training data instances are not available due to privacy concerns. In this work, we consider the problem of finding the optimum classifier based on some estimated low order marginals of (X, Y).
To infer a multilayer representation of high-dimensional count vectors, we propose the Poisson gamma belief network (PGBN) that factorizes each of its layers into the product of a connection weight matrix and the nonnegative real hidden units of the next layer. The PGBN's hidden layers are jointly trained with an upward-downward Gibbs sampler, each iteration of which upward samples Dirichlet distributed connection weight vectors starting from the first layer (bottom data layer), and then downward samples gamma distributed hidden units starting from the top hidden layer. The gamma-negative binomial process combined with a layer-wise training strategy allows the PGBN to infer the width of each layer given a fixed budget on the width of the first layer. The PGBN with a single hidden layer reduces to Poisson factor analysis. Example results on text analysis illustrate interesting relationships between the width of the first layer and the inferred network structure, and demonstrate that the PGBN, whose hidden units are imposed with correlated gamma priors, can add more layers to increase its performance gains over Poisson factor analysis, given the same limit on the width of the first layer.
Practitioners of Bayesian statistics have long depended on Markov chain Monte Carlo (MCMC) to obtain samples from intractable posterior distributions. Unfortunately, MCMC algorithms are typically serial, and do not scale to the large datasets typical of modern machine learning. The recently proposed consensus Monte Carlo algorithm removes this limitation by partitioning the data and drawing samples conditional on each partition in parallel . A fixed aggregation function then combines these samples, yielding approximate posterior samples. We introduce variational consensus Monte Carlo (VCMC), a variational Bayes algorithm that optimizes over aggregation functions to obtain samples from a distribution that better approximates the target. The resulting objective contains an intractable entropy term; we therefore derive a relaxation of the objective and show that the relaxed problem is blockwise concave under mild conditions. We illustrate the advantages of our algorithm on three inference tasks from the literature, demonstrating both the superior quality of the posterior approximation and the moderate overhead of the optimization step. Our algorithm achieves a relative error reduction (measured against serial MCMC) of up to 39% compared to consensus Monte Carlo on the task of estimating 300-dimensional probit regression parameter expectations; similarly, it achieves an error reduction of 92% on the task of estimating cluster comembership probabilities in a Gaussian mixture model with 8 components in 8 dimensions. Furthermore, these gains come at moderate cost compared to the runtime of serial MCMC--achieving near-ideal speedup in some instances.