Bayesian Learning
Sanity Checking Causal Representation Learning on a Simple Real-World System
Gamella, Juan L., Bing, Simon, Runge, Jakob
We evaluate methods for causal representation learning (CRL) on a simple, real-world system where these methods are expected to work. The system consists of a controlled optical experiment specifically built for this purpose, which satisfies the core assumptions of CRL and where the underlying causal factors (the inputs to the experiment) are known, providing a ground truth. We select methods representative of different approaches to CRL and find that they all fail to recover the underlying causal factors. To understand the failure modes of the evaluated algorithms, we perform an ablation on the data by substituting the real data-generating process with a simpler synthetic equivalent. The results reveal a reproducibility problem, as most methods already fail on this synthetic ablation despite its simple data-generating process. Additionally, we observe that common assumptions on the mixing function are crucial for the performance of some of the methods but do not hold in the real data. Our efforts highlight the contrast between the theoretical promise of the state of the art and the challenges in its application. We hope the benchmark serves as a simple, real-world sanity check to further develop and validate methodology, bridging the gap towards CRL methods that work in practice. We make all code and datasets publicly available at github.com/simonbing/CRLSanityCheck
Text classification using machine learning methods
In this paper we present the results of an experiment aimed to use machine learning methods to obtain models that can be used for the automatic classification of products. In order to apply automatic classification methods, we transformed the product names from a text representation to numeric vectors, a process called word embedding. We used several embedding methods: Count Vectorization, TF-IDF, Word2Vec, FASTTEXT, and GloVe. Having the product names in a form of numeric vectors, we proceeded with a set of machine learning methods for automatic classification: Logistic Regression, Multinomial Naive Bayes, kNN, Artificial Neural Networks, Support Vector Machines, and Decision trees with several variants. The results show an impressive accuracy of the classification process for Support Vector Machines, Logistic Regression, and Random Forests. Regarding the word embedding methods, the best results were obtained with the FASTTEXT technique.
A Fokker-Planck-Based Loss Function that Bridges Dynamics with Density Estimation
Lu, Zhixin, Kuśmierz, Łukasz, Mihalas, Stefan
We have derived a novel loss function from the Fokker-Planck equation that links dynamical system models with their probability density functions, demonstrating its utility in model identification and density estimation. In the first application, we show that this loss function can enable the extraction of dynamical parameters from non-temporal datasets, including timestamp-free measurements from steady non-equilibrium systems such as noisy Lorenz systems and gene regulatory networks. In the second application, when coupled with a density estimator, this loss facilitates density estimation when the dynamic equations are known. For density estimation, we propose a density estimator that integrates a Gaussian Mixture Model with a normalizing flow model. It simultaneously estimates normalized density, energy, and score functions from both empirical data and dynamics. It is compatible with a variety of data-based training methodologies, including maximum likelihood and score matching. It features a latent space akin to a modern Hopfield network, where the inherent Hopfield energy effectively assigns low densities to sparsely populated data regions, addressing common challenges in neural density estimators. Additionally, this Hopfield-like energy enables direct and rapid data manipulation through the Concave-Convex Procedure (CCCP) rule, facilitating tasks such as denoising and clustering. Our work demonstrates a principled framework for leveraging the complex interdependencies between dynamics and density estimation, as illustrated through synthetic examples that clarify the underlying theoretical intuitions.
Stein's unbiased risk estimate and Hyv\"arinen's score matching
Ghosh, Sulagna, Ignatiadis, Nikolaos, Koehler, Frederic, Lee, Amber
We study two G-modeling strategies for estimating the signal distribution (the empirical Bayesian's prior) from observations corrupted with normal noise. First, we choose the signal distribution by minimizing Stein's unbiased risk estimate (SURE) of the implied Eddington/Tweedie Bayes denoiser, an approach motivated by optimal empirical Bayesian shrinkage estimation of the signals. Second, we select the signal distribution by minimizing Hyv\"arinen's score matching objective for the implied score (derivative of log-marginal density), targeting minimal Fisher divergence between estimated and true marginal densities. While these strategies appear distinct, they are known to be mathematically equivalent. We provide a unified analysis of SURE and score matching under both well-specified signal distribution classes and misspecification. In the classical well-specified setting with homoscedastic noise and compactly supported signal distribution, we establish nearly parametric rates of convergence of the empirical Bayes regret and the Fisher divergence. In a commonly studied misspecified model, we establish fast rates of convergence to the oracle denoiser and corresponding oracle inequalities. Our empirical results demonstrate competitiveness with nonparametric maximum likelihood in well-specified settings, while showing superior performance under misspecification, particularly in settings involving heteroscedasticity and side information.
Bayesian Computation in Deep Learning
Chen, Wenlong, Li, Bolian, Zhang, Ruqi, Li, Yingzhen
Bayesian computation has achieved profound success in many modeling tasks with statistics tools such as generalized linear models (Dobson and Barnett, 2018; Nelder and Wedderburn, 1972). Yet these traditional tools fail to produce satisfactory predictions for high-dimensional and highly complex data such as images, speech and videos. Deep Learning (LeCun et al., 2015a) provides an attractive solution. At the time of late 2023, deep neural networks achieve accurate predictions for image classification (Dehghani et al., 2023), segmentation (Kirillov et al., 2023) and speech recognition tasks (Zhang et al., 2023). Meanwhile they have also demonstrated an astonishing capability for generating photo-realistic and/or artistic images (Rombach et al., 2022), music (Agostinelli et al., 2023) and videos (Liang et al., 2022). Nowadays deep neural networks have become a standard modeling tool for many of the applications in AI and related fields, and the success of deep learning so far are based on training deterministic deep neural networks on big data. So one might ask: is there a place for Bayesian computation in modern deep learning?
Sparkle: A Statistical Learning Toolkit for High-Dimensional Hawkes Processes in Python
This paper introduce the Python package Sparklen (see Lacoste (2025)), which implements a complete set of statistical learning methods for exponential Hawkes processes with an emphasize on high-dimension setting. Hawkes processes, introduced in Hawkes (1971), form a specific but rather versatile class of point processes. Such processes model time series in which the occurrence of one event temporarily increases the probability of other events occurring. This intrinsic ability to take into account self-exciting effects makes them particularly interesting for real data modeling. Historically applied in seismology (see Ogata (1988)), they have since been used in a wide variety of other fields, including neuroscience in Reynaud-Bouret, Rivoirard, and Tuleau-Malot (2013), finance in Bacry, Mastromatteo, and Muzy (2015), ecology in Denis, Dion-Blanc, Lacoste, Sansonnet, and Bas (2024). The multidimensional version, known as the Multivariate Hawkes Processes (MHP), captures additionally interactions among each univariate process within a network. This generalization enables the modeling of more intricate dynamics, significantly expanding the range of potential applications. For example, MHP has been applied to model action potentials within neural networks in Bonnet, Dion-Blanc, Gindraud, and Lemler (2022), or for trend detection in social networks in Pinto, Chahed, and Altman (2015).
Truth in Text: A Meta-Analysis of ML-Based Cyber Information Influence Detection Approaches
Cyber information influence, or disinformation in general terms, is widely regarded as one of the biggest threats to social progress and government stability. From US presidential elections to European Union referendums and down to regional news reporting of wildfires, lies and post-truths have normalized radical decision-making. Accordingly, there has been an explosion in research seeking to detect disinformation in online media. The frontier of disinformation detection research is leveraging a variety of ML techniques such as traditional ML algorithms like Support Vector Machines, Random Forest, and Na\"ive Bayes. Other research has applied deep learning models including Convolutional Neural Networks, Long Short-Term Memory networks, and transformer-based architectures. Despite the overall success of such techniques, the literature demonstrates inconsistencies when viewed holistically which limits our understanding of the true effectiveness. Accordingly, this work employed a two-stage meta-analysis to (a) demonstrate an overall meta statistic for ML model effectiveness in detecting disinformation and (b) investigate the same by subgroups of ML model types. The study found the majority of the 81 ML detection techniques sampled have greater than an 80\% accuracy with a Mean sample effectiveness of 79.18\% accuracy. Meanwhile, subgroups demonstrated no statistically significant difference between-approaches but revealed high within-group variance. Based on the results, this work recommends future work in replication and development of detection methods operating at the ML model level.
Guarding the Privacy of Label-Only Access to Neural Network Classifiers via iDP Verification
Kabaha, Anan, Drachsler-Cohen, Dana
Neural networks are susceptible to privacy attacks that can extract private information of the training set. To cope, several training algorithms guarantee differential privacy (DP) by adding noise to their computation. However, DP requires to add noise considering every possible training set. This leads to a significant decrease in the network's accuracy. Individual DP (iDP) restricts DP to a given training set. We observe that some inputs deterministically satisfy iDP without any noise. By identifying them, we can provide iDP label-only access to the network with a minor decrease to its accuracy. However, identifying the inputs that satisfy iDP without any noise is highly challenging. Our key idea is to compute the iDP deterministic bound (iDP-DB), which overapproximates the set of inputs that do not satisfy iDP, and add noise only to their predicted labels. To compute the tightest iDP-DB, which enables to guard the label-only access with minimal accuracy decrease, we propose LUCID, which leverages several formal verification techniques. First, it encodes the problem as a mixed-integer linear program, defined over a network and over every network trained identically but without a unique data point. Second, it abstracts a set of networks using a hyper-network. Third, it eliminates the overapproximation error via a novel branch-and-bound technique. Fourth, it bounds the differences of matching neurons in the network and the hyper-network and employs linear relaxation if they are small. We show that LUCID can provide classifiers with a perfect individuals' privacy guarantee (0-iDP) -- which is infeasible for DP training algorithms -- with an accuracy decrease of 1.4%. For more relaxed $\varepsilon$-iDP guarantees, LUCID has an accuracy decrease of 1.2%. In contrast, existing DP training algorithms reduce the accuracy by 12.7%.
Partition Tree Weighting for Non-Stationary Stochastic Bandits
Veness, Joel, Hutter, Marcus, Gyorgy, Andras, Grau-Moya, Jordi
In contrast to popular decision-making frameworks such as reinforcement learning, which are built upon appealing to decision-theoretic notions such as Maximum Expected Utility, we instead construct an agent by trying to minimise the expected number of bits needed to losslessly describe general agent-environment interactions. The appeal with this approach is that if we can construct a good universal coding scheme for arbitrary agent interactions, one could simply sample from this coding distribution to generate a control policy. However when considering general agents, whose goal is to work well across multiple environments, this question turns out to be surprisingly subtle. Naive approaches which do not discriminate between actions and observations fail, and are subject to the self-delusion problem [Ortega et al., 2021]. In this work, we will adopt a universal source coding perspective to this question, and showcase its efficacy by applying it to the challenging non-stationary stochastic bandit problem. In the passive case, namely, sequential prediction of observations under the logarithmic loss, there is a well developed universal source coding literature for dealing with non-stationary sources under various types of non-stationarity. The most influential idea for modelling piecewise stationary sources is the transition diagram technique of Willems [1996]. This technique performs Bayesian model averaging over all possible partitions of a sequence of data, cleverly exploiting dynamic programming to yield an algorithm which has both quadratic time complexity and provable regret guarantees.
Mixture models for data with unknown distributions
We describe and analyze a broad class of mixture models for real-valued multivariate data in which the probability density of observations within each component of the model is represented as an arbitrary combination of basis functions. Fits to these models give us a way to cluster data with distributions of unknown form, including strongly non-Gaussian or multimodal distributions, and return both a division of the data and an estimate of the distributions, effectively performing clustering and density estimation within each cluster at the same time. We describe two fitting methods, one using an expectation-maximization (EM) algorithm and the other a Bayesian non-parametric method using a collapsed Gibbs sampler. The former is numerically efficient, but gives only point estimates of the probability densities. The latter is more computationally demanding but returns a full Bayesian posterior and also an estimate of the number of components. We demonstrate our methods with a selection of illustrative applications and give code implementing both algorithms.