Bayesian Inference
Algorithmic recourse under imperfect causal knowledge: a probabilistic approach
Karimi, Amir-Hossein, von Kügelgen, Julius, Schölkopf, Bernhard, Valera, Isabel
Recent work has discussed the limitations of counterfactual explanations to recommend actions for algorithmic recourse, and argued for the need of taking causal relationships between features into consideration. Unfortunately, in practice, the true underlying structural causal model is generally unknown. In this work, we first show that it is impossible to guarantee recourse without access to the true structural equations. To address this limitation, we propose two probabilistic approaches to select optimal actions that achieve recourse with high probability given limited causal knowledge (e.g., only the causal graph). The first captures uncertainty over structural equations under additive Gaussian noise, and uses Bayesian model averaging to estimate the counterfactual distribution. The second removes any assumptions on the structural equations by instead computing the average effect of recourse actions on individuals similar to the person who seeks recourse, leading to a novel subpopulation-based interventional notion of recourse. We then derive a gradient-based procedure for selecting optimal recourse actions, and empirically show that the proposed approaches lead to more reliable recommendations under imperfect causal knowledge than non-probabilistic baselines.
DreamCoder: Growing generalizable, interpretable knowledge with wake-sleep Bayesian program learning
Ellis, Kevin, Wong, Catherine, Nye, Maxwell, Sable-Meyer, Mathias, Cary, Luc, Morales, Lucas, Hewitt, Luke, Solar-Lezama, Armando, Tenenbaum, Joshua B.
Expert problem-solving is driven by powerful languages for thinking about problems and their solutions. Acquiring expertise means learning these languages -- systems of concepts, alongside the skills to use them. We present DreamCoder, a system that learns to solve problems by writing programs. It builds expertise by creating programming languages for expressing domain concepts, together with neural networks to guide the search for programs within these languages. A ``wake-sleep'' learning algorithm alternately extends the language with new symbolic abstractions and trains the neural network on imagined and replayed problems. DreamCoder solves both classic inductive programming tasks and creative tasks such as drawing pictures and building scenes. It rediscovers the basics of modern functional programming, vector algebra and classical physics, including Newton's and Coulomb's laws. Concepts are built compositionally from those learned earlier, yielding multi-layered symbolic representations that are interpretable and transferrable to new tasks, while still growing scalably and flexibly with experience.
p-d-Separation -- A Concept for Expressing Dependence/Independence Relations in Causal Networks
Spirtes, Glymour and Scheines formulated a Conjecture that a direct dependence test and a head-to-head meeting test would suffice to construe directed acyclic graph decompositions of a joint probability distribution (Bayesian network) for which Pearl's d-separation applies. This Conjecture was later shown to be a direct consequence of a result of Pearl and Verma. This paper is intended to prove this Conjecture in a new way, by exploiting the concept of p-d-separation (partial dependency separation). While Pearl's d-separation works with Bayesian networks, p-d-separation is intended to apply to causal networks: that is partially oriented networks in which orientations are given to only to those edges, that express statistically confirmed causal influence, whereas undirected edges express existence of direct influence without possibility of determination of direction of causation. As a consequence of the particular way of proving the validity of this Conjecture, an algorithm for construction of all the directed acyclic graphs (dags) carrying the available independence information is also presented. The notion of a partially oriented graph (pog) is introduced and within this graph the notion of p-d-separation is defined. It is demonstrated that the p-d-separation within the pog is equivalent to d-separation in all derived dags.
Algebraic Ground Truth Inference: Non-Parametric Estimation of Sample Errors by AI Algorithms
Corrada-Emmanuel, Andrés, Pantridge, Edward, Zahrebelski, Edward, Chaganti, Aditya, Simeonov, Simeon
Binary classification is widely used in ML production systems. Monitoring classifiers in a constrained event space is well known. However, real world production systems often lack the ground truth these methods require. Privacy concerns may also require that the ground truth needed to evaluate the classifiers cannot be made available. In these autonomous settings, non-parametric estimators of performance are an attractive solution. They do not require theoretical models about how the classifiers made errors in any given sample. They just estimate how many errors there are in a sample of an industrial or robotic datastream. We construct one such non-parametric estimator of the sample errors for an ensemble of weak binary classifiers. Our approach uses algebraic geometry to reformulate the self-assessment problem for ensembles of binary classifiers as an exact polynomial system. The polynomial formulation can then be used to prove - as an algebraic geometry algorithm - that no general solution to the self-assessment problem is possible. However, specific solutions are possible in settings where the engineering context puts the classifiers close to independent errors. The practical utility of the method is illustrated on a real-world dataset from an online advertising campaign and a sample of common classification benchmarks. The accuracy estimators in the experiments where we have ground truth are better than one part in a hundred. The online advertising campaign data, where we do not have ground truth data, is verified by an internal consistency approach whose validity we conjecture as an algebraic geometry theorem. We call this approach - algebraic ground truth inference.
Deep Autoencoding Topic Model with Scalable Hybrid Bayesian Inference
Zhang, Hao, Chen, Bo, Cong, Yulai, Guo, Dandan, Liu, Hongwei, Zhou, Mingyuan
To build a flexible and interpretable model for document analysis, we develop deep autoencoding topic model (DATM) that uses a hierarchy of gamma distributions to construct its multi-stochastic-layer generative network. In order to provide scalable posterior inference for the parameters of the generative network, we develop topic-layer-adaptive stochastic gradient Riemannian MCMC that jointly learns simplex-constrained global parameters across all layers and topics, with topic and layer specific learning rates. Given a posterior sample of the global parameters, in order to efficiently infer the local latent representations of a document under DATM across all stochastic layers, we propose a Weibull upward-downward variational encoder that deterministically propagates information upward via a deep neural network, followed by a Weibull distribution based stochastic downward generative model. To jointly model documents and their associated labels, we further propose supervised DATM that enhances the discriminative power of its latent representations. The efficacy and scalability of our models are demonstrated on both unsupervised and supervised learning tasks on big corpora.
Estimation of Skill Distributions
Jadbabaie, Ali, Makur, Anuran, Shah, Devavrat
In this paper, we study the problem of learning the skill distribution of a population of agents from observations of pairwise games in a tournament. These games are played among randomly drawn agents from the population. The agents in our model can be individuals, sports teams, or Wall Street fund managers. Formally, we postulate that the likelihoods of game outcomes are governed by the Bradley-Terry-Luce (or multinomial logit) model, where the probability of an agent beating another is the ratio between its skill level and the pairwise sum of skill levels, and the skill parameters are drawn from an unknown skill density of interest. The problem is, in essence, to learn a distribution from noisy, quantized observations. We propose a simple and tractable algorithm that learns the skill density with near-optimal minimax mean squared error scaling as $n^{-1+\varepsilon}$, for any $\varepsilon>0$, when the density is smooth. Our approach brings together prior work on learning skill parameters from pairwise comparisons with kernel density estimation from non-parametric statistics. Furthermore, we prove minimax lower bounds which establish minimax optimality of the skill parameter estimation technique used in our algorithm. These bounds utilize a continuum version of Fano's method along with a covering argument. We apply our algorithm to various soccer leagues and world cups, cricket world cups, and mutual funds. We find that the entropy of a learnt distribution provides a quantitative measure of skill, which provides rigorous explanations for popular beliefs about perceived qualities of sporting events, e.g., soccer league rankings. Finally, we apply our method to assess the skill distributions of mutual funds. Our results shed light on the abundance of low quality funds prior to the Great Recession of 2008, and the domination of the industry by more skilled funds after the financial crisis.
Root Cause Analysis in Lithium-Ion Battery Production with FMEA-Based Large-Scale Bayesian Network
Kirchhof, Michael, Haas, Klaus, Kornas, Thomas, Thiede, Sebastian, Hirz, Mario, Herrmann, Christoph
The production of lithium-ion battery cells is characterized by a high degree of complexity due to numerous cause-effect relationships between process characteristics. Knowledge about the multi-stage production is spread among several experts, rendering tasks as failure analysis challenging. In this paper, a new method is presented that includes expert knowledge acquisition in production ramp-up by combining Failure Mode and Effects Analysis (FMEA) with a Bayesian Network. Special algorithms are presented that help detect and resolve inconsistencies between the expert-provided parameters which are bound to occur when collecting knowledge from several process experts. We show the effectiveness of this holistic method by building up a large scale, cross-process Bayesian Failure Network in lithium-ion battery production and its application for root cause analysis.
Multi-Purchase Behavior: Modeling and Optimization
Tulabandhula, Theja, Sinha, Deeksha, Patidar, Prasoon
We study the problem of modeling purchase of multiple items and utilizing it to display optimized recommendations, which is a central problem for online e-commerce platforms. Rich personalized modeling of users and fast computation of optimal products to display given these models can lead to significantly higher revenues and simultaneously enhance the end user experience. We present a parsimonious multi-purchase family of choice models called the BundleMVL-K family, and develop a binary search based iterative strategy that efficiently computes optimized recommendations for this model. This is one of the first attempts at operationalizing multi-purchase class of choice models. We characterize structural properties of the optimal solution, which allow one to decide if a product is part of the optimal assortment in constant time, reducing the size of the instance that needs to be solved computationally. We also establish the hardness of computing optimal recommendation sets. We show one of the first quantitative links between modeling multiple purchase behavior and revenue gains. The efficacy of our modeling and optimization techniques compared to competing solutions is shown using several real world datasets on multiple metrics such as model fitness, expected revenue gains and run-time reductions. The benefit of taking multiple purchases into account is observed to be $6-8\%$ in relative terms for the Ta Feng and UCI shopping datasets when compared to the MNL model for instances with $\sim 1500$ products. Additionally, across $8$ real world datasets, the test log-likelihood fits of our models are on average $17\%$ better in relative terms. The simplicity of our models and the iterative nature of our optimization technique allows practitioners meet stringent computational constraints while increasing their revenues in practical recommendation applications at scale.
Estimation of dense stochastic block models visited by random walks
Tran, Viet Chi, Vo, Thi Phuong Thuy
We are interested in recovering information on a stochastic block model from the subgraph discovered by an exploring random walk. Stochastic block models correspond to populations structured into a finite number of types, where two individuals are connected by an edge independently from the other pairs and with a probability depending on their types. We consider here the dense case where the random network can be approximated by a graphon. This problem is motivated from the study of chain-referral surveys where each interviewee provides information on her/his contacts in the social network. First, we write the likelihood of the subgraph discovered by the random walk: biases are appearing since hubs and majority types are more likely to be sampled. Even for the case where the types are observed, the maximum likelihood estimator is not explicit any more. When the types of the vertices is unobserved, we use an SAEM algorithm to maximize the likelihood. Second, we propose a different estimation strategy using new results by Athreya and Roellin. It consists in de-biasing the maximum likelihood estimator proposed in Daudin et al. and that ignores the biases.
Model Linkage Selection for Cooperative Learning
Zhou, Jiaying, Ding, Jie, Tan, Kean Ming, Tarokh, Vahid
Rapid developments in data collecting devices and computation platforms produce an emerging number of learners and data modalities in many scientific domains. We consider the setting in which each learner holds a pair of parametric statistical model and a specific data source, with the goal of integrating information across a set of learners to enhance the prediction accuracy of a specific learner. One natural way to integrate information is to build a joint model across a set of learners that shares common parameters of interest. However, the parameter sharing patterns across a set of learners are not known a priori. Misspecifying the parameter sharing patterns and the parametric statistical model for each learner yields a biased estimator and degrades the prediction accuracy of the joint model. In this paper, we propose a novel framework for integrating information across a set of learners that is robust against model misspecification and misspecified parameter sharing patterns. The main crux is to sequentially incorporates additional learners that can enhance the prediction accuracy of an existing joint model based on a user-specified parameter sharing patterns across a set of learners, starting from a model with one learner. Theoretically, we show that the proposed method can data-adaptively select the correct parameter sharing patterns based on a user-specified parameter sharing patterns, and thus enhances the prediction accuracy of a learner. Extensive numerical studies are performed to evaluate the performance of the proposed method.