Bayesian Learning
Estimation of the Learning Coefficient Using Empirical Loss
Takio, Tatsuyoshi, Suzuki, Joe
The learning coefficient plays a crucial role in analyzing the performance of information criteria, such as the Widely Applicable Information Criterion (WAIC) and the Widely Applicable Bayesian Information Criterion (WBIC), which Sumio Watanabe developed to assess model generalization ability. In regular statistical models, the learning coefficient is given by d/2, where d is the dimension of the parameter space. More generally, it is defined as the absolute value of the pole order of a zeta function derived from the Kullback-Leibler divergence and the prior distribution. However, except for specific cases such as reduced-rank regression, the learning coefficient cannot be derived in a closed form. Watanabe proposed a numerical method to estimate the learning coefficient, which Imai further refined to enhance its convergence properties. These methods utilize the asymptotic behavior of WBIC and have been shown to be statistically consistent as the sample size grows. In this paper, we propose a novel numerical estimation method that fundamentally differs from previous approaches and leverages a new quantity, "Empirical Loss," which was introduced by Watanabe. Through numerical experiments, we demonstrate that our proposed method exhibits both lower bias and lower variance compared to those of Watanabe and Imai. Additionally, we provide a theoretical analysis that elucidates why our method outperforms existing techniques and present empirical evidence that supports our findings.
Expert-Agnostic Learning to Defer
Strong, Joshua, Saha, Pramit, Ibrahim, Yasin, Ouyang, Cheng, Noble, Alison
Recent advancements in this field have including the development of consistent surrogate losses for introduced features enabling flexibility to unseen training these systems (Mozannar & Sontag, 2021; Verma experts at test-time, but we find these approaches & Nalisnick, 2022), and extensions that allow for deferral have significant limitations. To address these, we to multiple experts (Verma et al., 2023). Recent work by introduce EA-L2D: Expert-Agnostic Learning to Tailor et al. (2024) proposed a meta-learning solution for Defer, a novel L2D framework that leverages a L2D systems that can adapt to experts not seen during the Bayesian approach to model expert behaviour in training regime through meta-learning representations of an expert-agnostic manner, facilitating optimal expert behaviours, enabling the system to quickly adapt to deferral decisions. EA-L2D offers several critical new experts using a small set of their example predictions, improvements over prior methods, including denoted context predictions. However, this approach exhibits the ability to incorporate prior knowledge about a key weakness in limited generalisation to experts experts, a reduced reliance on expert-annotated with expertise unseen during training. Additionally, their data, and robust performance when deferring to solution poses problems seen more widely in L2D literature, experts with expertise not seen during training.
Benchmarking the rationality of AI decision making using the transitivity axiom
Song, Kiwon, Jennings, James M. III, Davis-Stober, Clintin P.
Fundamental choice axioms, such as transitivity of preference, provide testable conditions for determining whether human decision making is rational, i.e., consistent with a utility representation. Recent work has demonstrated that AI systems trained on human data can exhibit similar reasoning biases as humans and that AI can, in turn, bias human judgments through AI recommendation systems. We evaluate the rationality of AI responses via a series of choice experiments designed to evaluate transitivity of preference in humans. We considered ten versions of Meta's Llama 2 and 3 LLM models. We applied Bayesian model selection to evaluate whether these AI-generated choices violated two prominent models of transitivity. We found that the Llama 2 and 3 models generally satisfied transitivity, but when violations did occur, occurred only in the Chat/Instruct versions of the LLMs. We argue that rationality axioms, such as transitivity of preference, can be useful for evaluating and benchmarking the quality of AI-generated responses and provide a foundation for understanding computational rationality in AI systems more generally.
Large Language Models for Causal Discovery: Current Landscape and Future Directions
Wan, Guangya, Lu, Yunsheng, Wu, Yuqi, Hu, Mengxuan, Li, Sheng
Causal discovery (CD) and Large Language Models (LLMs) have emerged as transformative fields in artificial intelligence that have evolved largely independently. While CD specializes in uncovering cause-effect relationships from data, and LLMs excel at natural language processing and generation, their integration presents unique opportunities for advancing causal understanding. This survey examines how LLMs are transforming CD across three key dimensions: direct causal extraction from text, integration of domain knowledge into statistical methods, and refinement of causal structures. We systematically analyze approaches that leverage LLMs for CD tasks, highlighting their innovative use of metadata and natural language for causal inference. Our analysis reveals both LLMs' potential to enhance traditional CD methods and their current limitations as imperfect expert systems. We identify key research gaps, outline evaluation frameworks and benchmarks for LLM-based causal discovery, and advocate future research efforts for leveraging LLMs in causality research. As the first comprehensive examination of the synergy between LLMs and CD, this work lays the groundwork for future advances in the field.
A Unified Evaluation Framework for Epistemic Predictions
Manchingal, Shireen Kudukkil, Mubashar, Muhammad, Wang, Kaizheng, Cuzzolin, Fabio
X Y the available training set, diverse, ranging from single point estimates N being the number of training instances. In Bayesian (often averaged over prediction samples) to Neural Networks (BNNs) (Buntine and Weigend, 1991; predictive distributions, to set-valued or Neal, 2012; Jospin et al., 2022; Kingma and Welling, credal-set representations. We propose a novel 2013), this uncertainty is explicitly represented through unified evaluation framework for uncertaintyaware posterior predictive distributions over the parameter classifiers, applicable to a wide range space. In Deep Ensembles (DEs) (Lakshminarayanan of model classes, which allows users to tailor et al., 2017), a predictive distribution is formed by the trade-off between accuracy and precision aggregating the individual predictions generated by of predictions via a suitably designed performance multiple independently trained models.
Explain Yourself, Briefly! Self-Explaining Neural Networks with Concise Sufficient Reasons
Bassan, Shahaf, Eliav, Ron, Gur, Shlomit
*Minimal sufficient reasons* represent a prevalent form of explanation - the smallest subset of input features which, when held constant at their corresponding values, ensure that the prediction remains unchanged. Previous *post-hoc* methods attempt to obtain such explanations but face two main limitations: (1) Obtaining these subsets poses a computational challenge, leading most scalable methods to converge towards suboptimal, less meaningful subsets; (2) These methods heavily rely on sampling out-of-distribution input assignments, potentially resulting in counterintuitive behaviors. To tackle these limitations, we propose in this work a self-supervised training approach, which we term *sufficient subset training* (SST). Using SST, we train models to generate concise sufficient reasons for their predictions as an integral part of their output. Our results indicate that our framework produces succinct and faithful subsets substantially more efficiently than competing post-hoc methods, while maintaining comparable predictive performance.
Do Large Language Models Reason Causally Like Us? Even Better?
Dettki, Hanna M., Lake, Brenden M., Wu, Charley M., Rehder, Bob
Causal reasoning is a core component of intelligence. Large language models (LLMs) have shown impressive capabilities in generating human-like text, raising questions about whether their responses reflect true understanding or statistical patterns. We compared causal reasoning in humans and four LLMs using tasks based on collider graphs, rating the likelihood of a query variable occurring given evidence from other variables. We find that LLMs reason causally along a spectrum from human-like to normative inference, with alignment shifting based on model, context, and task. Overall, GPT-4o and Claude showed the most normative behavior, including "explaining away", whereas Gemini-Pro and GPT-3.5 did not. Although all agents deviated from the expected independence of causes - Claude the least - they exhibited strong associative reasoning and predictive inference when assessing the likelihood of the effect given its causes. These findings underscore the need to assess AI biases as they increasingly assist human decision-making.
Fenchel-Young Variational Learning
Sklaviadis, Sophia, Agrawal, Sweta, Farinhas, Antonio, Martins, Andre, Figueiredo, Mario
From a variational perspective, many statistical learning criteria involve seeking a distribution that balances empirical risk and regularization. In this paper, we broaden this perspective by introducing a new general class of variational methods based on Fenchel-Young (FY) losses, treated as divergences that generalize (and encompass) the familiar Kullback-Leibler divergence at the core of classical variational learning. Our proposed formulation -- FY variational learning -- includes as key ingredients new notions of FY free energy, FY evidence, FY evidence lower bound, and FY posterior. We derive alternating minimization and gradient backpropagation algorithms to compute (or lower bound) the FY evidence, which enables learning a wider class of models than previous variational formulations. This leads to generalized FY variants of classical algorithms, such as an FY expectation-maximization (FYEM) algorithm, and latent-variable models, such as an FY variational autoencoder (FYVAE). Our new methods are shown to be empirically competitive, often outperforming their classical counterparts, and most importantly, to have qualitatively novel features. For example, FYEM has an adaptively sparse E-step, while the FYVAE can support models with sparse observations and sparse posteriors.
Revisiting the Berkeley Admissions data: Statistical Tests for Causal Hypotheses
Bhadane, Sourbh, Mooij, Joris M., Boeken, Philip, Zoeter, Onno
Reasoning about fairness through correlation-based notions is rife with pitfalls. The 1973 University of California, Berkeley graduate school admissions case from Bickel et. al. (1975) is a classic example of one such pitfall, namely Simpson's paradox. The discrepancy in admission rates among males and female applicants, in the aggregate data over all departments, vanishes when admission rates per department are examined. We reason about the Berkeley graduate school admissions case through a causal lens. In the process, we introduce a statistical test for causal hypothesis testing based on Pearl's instrumental-variable inequalities (Pearl 1995). We compare different causal notions of fairness that are based on graphical, counterfactual and interventional queries on the causal model, and develop statistical tests for these notions that use only observational data. We study the logical relations between notions, and show that while notions may not be equivalent, their corresponding statistical tests coincide for the case at hand. We believe that a thorough case-based causal analysis helps develop a more principled understanding of both causal hypothesis testing and fairness.
A Latent Causal Inference Framework for Ordinal Variables
Scauda, Martina, Kuipers, Jack, Moffa, Giusi
Ordinal variables, such as on the Likert scale, are common in applied research. Yet, existing methods for causal inference tend to target nominal or continuous data. When applied to ordinal data, this fails to account for the inherent ordering or imposes well-defined relative magnitudes. Hence, there is a need for specialised methods to compute interventional effects between ordinal variables while accounting for their ordinality. One potential framework is to presume a latent Gaussian Directed Acyclic Graph (DAG) model: that the ordinal variables originate from marginally discretizing a set of Gaussian variables whose latent covariance matrix is constrained to satisfy the conditional independencies inherent in a DAG. Conditioned on a given latent covariance matrix and discretisation thresholds, we derive a closed-form function for ordinal causal effects in terms of interventional distributions in the latent space. Our causal estimation combines naturally with algorithms to learn the latent DAG and its parameters, like the Ordinal Structural EM algorithm. Simulations demonstrate the applicability of the proposed approach in estimating ordinal causal effects both for known and unknown structures of the latent graph. As an illustration of a real-world use case, the method is applied to survey data of 408 patients from a study on the functional relationships between symptoms of obsessive-compulsive disorder and depression.