Bayesian Learning
Scoring rule nets: beyond mean target prediction in multivariate regression
Probabilistic regression models trained with maximum likelihood estimation (MLE), can sometimes overestimate variance to an unacceptable degree. This is mostly problematic in the multivariate domain. While univariate models often optimize the popular Continuous Ranked Probability Score (CRPS), in the multivariate domain, no such alternative to MLE has yet been widely accepted. The Energy Score - the most investigated alternative - notoriously lacks closed-form expressions and sensitivity to the correlation between target variables. In this paper, we propose Conditional CRPS: a multivariate strictly proper scoring rule that extends CRPS. We show that closed-form expressions exist for popular distributions and illustrate their sensitivity to correlation. We then show in a variety of experiments on both synthetic and real data, that Conditional CRPS often outperforms MLE, and produces results comparable to state-of-the-art non-parametric models, such as Distributional Random Forest (DRF).
Challenging the Performance-Interpretability Trade-off: An Evaluation of Interpretable Machine Learning Models
Kruschel, Sven, Hambauer, Nico, Weinzierl, Sven, Zilker, Sandra, Kraus, Mathias, Zschech, Patrick
Machine learning is permeating every conceivable domain to promote data-driven decision support. The focus is often on advanced black-box models due to their assumed performance advantages, whereas interpretable models are often associated with inferior predictive qualities. More recently, however, a new generation of generalized additive models (GAMs) has been proposed that offer promising properties for capturing complex, non-linear patterns while remaining fully interpretable. To uncover the merits and limitations of these models, this study examines the predictive performance of seven different GAMs in comparison to seven commonly used machine learning models based on a collection of twenty tabular benchmark datasets. To ensure a fair and robust model comparison, an extensive hyperparameter search combined with cross-validation was performed, resulting in 68,500 model runs. In addition, this study qualitatively examines the visual output of the models to assess their level of interpretability. Based on these results, the paper dispels the misconception that only black-box models can achieve high accuracy by demonstrating that there is no strict trade-off between predictive performance and model interpretability for tabular data. Furthermore, the paper discusses the importance of GAMs as powerful interpretable models for the field of information systems and derives implications for future work from a socio-technical perspective.
Testing Causal Models with Hidden Variables in Polynomial Delay via Conditional Independencies
Jeong, Hyunchai, Ejaz, Adiba, Tian, Jin, Bareinboim, Elias
Testing a hypothesized causal model against observational data is a key prerequisite for many causal inference tasks. A natural approach is to test whether the conditional independence relations (CIs) assumed in the model hold in the data. While a model can assume exponentially many CIs (with respect to the number of variables), testing all of them is both impractical and unnecessary. Causal graphs, which encode these CIs in polynomial space, give rise to local Markov properties that enable model testing with a significantly smaller subset of CIs. Model testing based on local properties requires an algorithm to list the relevant CIs. However, existing algorithms for realistic settings with hidden variables and non-parametric distributions can take exponential time to produce even a single CI constraint. In this paper, we introduce the c-component local Markov property (C-LMP) for causal graphs with hidden variables. Since C-LMP can still invoke an exponential number of CIs, we develop a polynomial delay algorithm to list these CIs in poly-time intervals. To our knowledge, this is the first algorithm that enables poly-delay testing of CIs in causal graphs with hidden variables against arbitrary data distributions. Experiments on real-world and synthetic data demonstrate the practicality of our algorithm.
Predicting Coronary Heart Disease Using a Suite of Machine Learning Models
Al-Karaki, Jamal, Ilono, Philip, Baweja, Sanchit, Naghiyev, Jalal, Yadav, Raja Singh, Khan, Muhammad Al-Zafar
Coronary Heart Disease affects millions of people worldwide and is a well-studied area of healthcare. There are many viable and accurate methods for the diagnosis and prediction of heart disease, but they have limiting points such as invasiveness, late detection, or cost. Supervised learning via machine learning algorithms presents a low-cost (computationally speaking), non-invasive solution that can be a precursor for early diagnosis. In this study, we applied several well-known methods and benchmarked their performance against each other. It was found that Random Forest with oversampling of the predictor variable produced the highest accuracy of 84%.
A Ring-Based Distributed Algorithm for Learning High-Dimensional Bayesian Networks
Laborda, Jorge D., Torrijos, Pablo, Puerta, Josรฉ M., Gรกmez, Josรฉ A.
Learning Bayesian Networks (BNs) from high-dimensional data is a complex and time-consuming task. Although there are approaches based on horizontal (instances) or vertical (variables) partitioning in the literature, none can guarantee the same theoretical properties as the Greedy Equivalence Search (GES) algorithm, except those based on the GES algorithm itself. In this paper, we propose a directed ring-based distributed method that uses GES as the local learning algorithm, ensuring the same theoretical properties as GES but requiring less CPU time. The method involves partitioning the set of possible edges and constraining each processor in the ring to work only with its received subset. The global learning process is an iterative algorithm that carries out several rounds until a convergence criterion is met. In each round, each processor receives a BN from its predecessor in the ring, fuses it with its own BN model, and uses the result as the starting solution for a local learning process constrained to its set of edges. Subsequently, it sends the model obtained to its successor in the ring. Experiments were carried out on three large domains (400-1000 variables), demonstrating our proposal's effectiveness compared to GES and its fast version (fGES).
Efficient Identification of Direct Causal Parents via Invariance and Minimum Error Testing
Nguyen, Minh, Sabuncu, Mert R.
Invariant causal prediction (ICP) is a popular technique for finding causal parents (direct causes) of a target via exploiting distribution shifts and invariance testing (Peters et al., 2016). However, since ICP needs to run an exponential number of tests and fails to identify parents when distribution shifts only affect a few variables, applying ICP to practical large scale problems is challenging. We propose MMSE-ICP and fastICP, two approaches which employ an error inequality to address the identifiability problem of ICP. The inequality states that the minimum prediction error of the predictor using causal parents is the smallest among all predictors which do not use descendants. fastICP is an efficient approximation tailored for large problems as it exploits the inequality and a heuristic to run fewer tests. MMSE-ICP and fastICP not only outperform competitive baselines in many simulations but also achieve state-of-the-art result on a large scale real data benchmark.
(Un)certainty of (Un)fairness: Preference-Based Selection of Certainly Fair Decision-Makers
Duong, Manh Khoi, Conrad, Stefan
Fairness metrics are used to assess discrimination and disparity of the chances between yellow and blue candidates of getting bias in decision-making processes across various domains, including accepted. Intuitively, we are more certain about the decisions machine learning models and human decision-makers in real-world being made by company A than company B. In the case of company applications. This involves calculating the disparities between probabilistic B, the rejection of blue candidates can be attributed to random outcomes among social groups, such as acceptance rates circumstances. In this case, we would judge company A as more discriminatory between male and female applicants. However, traditional fairness than company B because we are more certain that A metrics do not account for the uncertainty in these processes and is unfair and very uncertain about the unfairness of B. But if both lack of comparability when two decision-makers exhibit the same companies accepted all applicants, the disparity would be 0%, and disparity. Using Bayesian statistics, we quantify the uncertainty of we would conversely judge B as more discriminatory than A. This is the disparity to enhance discrimination assessments. We represent because we are certain that A is fair, while we are uncertain about the each decision-maker, whether a machine learning model or a human, fairness of B. Lastly, when comparing between uncertain fair and uncertain by its disparity and the corresponding uncertainty in that disparity.
Unrolled denoising networks provably learn optimal Bayesian inference
Karan, Aayush, Shah, Kulin, Chen, Sitan, Eldar, Yonina C.
Much of Bayesian inference centers around the design of estimators for inverse problems which are optimal assuming the data comes from a known prior. But what do these optimality guarantees mean if the prior is unknown? In recent years, algorithm unrolling has emerged as deep learning's answer to this age-old question: design a neural network whose layers can in principle simulate iterations of inference algorithms and train on data generated by the unknown prior. Despite its empirical success, however, it has remained unclear whether this method can provably recover the performance of its optimal, prior-aware counterparts. In this work, we prove the first rigorous learning guarantees for neural networks based on unrolling approximate message passing (AMP). For compressed sensing, we prove that when trained on data drawn from a product prior, the layers of the network approximately converge to the same denoisers used in Bayes AMP. We also provide extensive numerical experiments for compressed sensing and rank-one matrix estimation demonstrating the advantages of our unrolled architecture - in addition to being able to obliviously adapt to general priors, it exhibits improvements over Bayes AMP in more general settings of low dimensions, non-Gaussian designs, and non-product priors.
Optimizing VarLiNGAM for Scalable and Efficient Time Series Causal Discovery
Jiao, Ziyang, Guo, Ce, Luk, Wayne
Causal discovery identifies causal relationships in data, but the task is more complex for multivariate time series due to the computational demands of methods like VarLiNGAM, which combines a Vector Autoregressive Model with a Linear Non-Gaussian Acyclic Model. This study optimizes causal discovery specifically for time series data, which are common in practical applications. Time series causal discovery is particularly challenging because of temporal dependencies and potential time lag effects. By developing a specialized dataset generator and reducing the computational complexity of the VarLiNGAM model from \( O(m^3 \cdot n) \) to \( O(m^3 + m^2 \cdot n) \), this study enhances the feasibility of processing large datasets. The proposed methods were validated on advanced computational platforms and tested on simulated, real-world, and large-scale datasets, demonstrating improved efficiency and performance. The optimized algorithm achieved 7 to 13 times speedup compared to the original and about 4.5 times speedup compared to the GPU-accelerated version on large-scale datasets with feature sizes from 200 to 400. Our methods extend current causal discovery capabilities, making them more robust, scalable, and applicable to real-world scenarios, facilitating advancements in fields like healthcare and finance.
The Central Role of the Loss Function in Reinforcement Learning
Wang, Kaiwen, Kallus, Nathan, Sun, Wen
This paper illustrates the central role of loss functions in data-driven decision making, providing a comprehensive survey on their influence in cost-sensitive classification (CSC) and reinforcement learning (RL). We demonstrate how different regression loss functions affect the sample efficiency and adaptivity of value-based decision making algorithms. Across multiple settings, we prove that algorithms using the binary cross-entropy loss achieve first-order bounds scaling with the optimal policy's cost and are much more efficient than the commonly used squared loss. Moreover, we prove that distributional algorithms using the maximum likelihood loss achieve second-order bounds scaling with the policy variance and are even sharper than first-order bounds. This in particular proves the benefits of distributional RL. We hope that this paper serves as a guide analyzing decision making algorithms with varying loss functions, and can inspire the reader to seek out better loss functions to improve any decision making algorithm.