Uncertainty
Expressive power of tensor-network factorizations for probabilistic modeling
Ivan Glasser, Ryan Sweke, Nicola Pancotti, Jens Eisert, Ignacio Cirac
Many problems in diverse areas of computer science and physics involve constructing efficient representations of high-dimensional functions. Neural networks are a particular example of such representations that have enjoyed great empirical success, and much effort has been dedicated to understanding their expressive power - i.e. the set of functions that they can efficiently represent. Analogously, tensor networks are a class of powerful representations of high-dimensional arrays (tensors), for which a variety of algorithms and methods have been developed.
Smooth Flow Matching
Functional data, i.e., smooth random functions observed over a continuous domain, are increasingly available in areas such as biomedical research, health informatics, and epidemiology. However, effective statistical analysis for functional data is often hindered by challenges such as privacy constraints, sparse and irregular sampling, infinite dimensionality, and non-Gaussian structures. To address these challenges, we introduce a novel framework named Smooth Flow Matching (SFM), tailored for generative modeling of functional data to enable statistical analysis without exposing sensitive real data. Built upon flow-matching ideas, SFM constructs a semiparametric copula flow to generate infinite-dimensional functional data, free from Gaussianity or low-rank assumptions. It is computationally efficient, handles irregular observations, and guarantees the smoothness of the generated functions, offering a practical and flexible solution in scenarios where existing deep generative methods are not applicable. Through extensive simulation studies, we demonstrate the advantages of SFM in terms of both synthetic data quality and computational efficiency. We then apply SFM to generate clinical trajectory data from the MIMIC-IV patient electronic health records (EHR) longitudinal database. Our analysis showcases the ability of SFM to produce high-quality surrogate data for downstream statistical tasks, highlighting its potential to boost the utility of EHR data for clinical applications.
A PC Algorithm for Max-Linear Bayesian Networks
Amรฉndola, Carlos, Hollering, Benjamin, Nowell, Francesco
Max-linear Bayesian networks (MLBNs) are a relatively recent class of structural equation models which arise when the random variables involved have heavy-tailed distributions. Unlike most directed graphical models, MLBNs are typically not faithful to d-separation and thus classical causal discovery algorithms such as the PC algorithm or greedy equivalence search can not be used to accurately recover the true graph structure. In this paper, we begin the study of constraint-based discovery algorithms for MLBNs given an oracle for testing conditional independence in the true, unknown graph. We show that if the oracle is given by the $\ast$-separation criteria in the true graph, then the PC algorithm remains consistent despite the presence of additional CI statements implied by $\ast$-separation. We also introduce a new causal discovery algorithm named "PCstar" which assumes faithfulness to $C^\ast$-separation and is able to orient additional edges which cannot be oriented with only d- or $\ast$-separation.
Structural Foundations for Leading Digit Laws: Beyond Probabilistic Mixtures
This article presents a modern deterministic framework for the study of leading significant digit distributions in numerical data. Rather than relying on traditional probabilistic or mixture-based explanations, we demonstrate that the observed frequencies of leading digits are determined by the underlying arithmetic, algorithmic, and structural properties of the data-generating process. Our approach centers on a shift-invariant functional equation, whose general solution is given by explicit affine-plus-periodic formulas. This structural formulation explains the diversity of digit distributions encountered in both empirical and mathematical datasets, including cases with pronounced deviations from logarithmic or scale-invariant profiles. We systematically analyze digit distributions in finite and infinite datasets, address deterministic sequences such as prime numbers and recurrence relations, and highlight the emergence of block-structured and fractal features. The article provides critical examination of probabilistic models, explicit examples and counterexamples, and discusses limitations and open problems for further research. Overall, this work establishes a unified mathematical foundation for digital phenomena and offers a versatile toolset for modeling and analyzing digit patterns in applied and theoretical contexts.
Order Optimal Regret Bounds for Sharpe Ratio Optimization in the Bandit Setting
Shah, Mohammad Taha, Khurshid, Sabrina, Ghatak, Gourab
In this paper, we investigate the problem of sequential decision-making for Sharpe ratio (SR) maximization in a stochastic bandit setting. We focus on the Thompson Sampling (TS) algorithm, a Bayesian approach celebrated for its empirical performance and exploration efficiency, under the assumption of Gaussian rewards with unknown parameters. Unlike conventional bandit objectives focusing on maximizing cumulative reward, Sharpe ratio optimization instead introduces an inherent tradeoff between achieving high returns and controlling risk, demanding careful exploration of both mean and variance. Our theoretical contributions include a novel regret decomposition specifically designed for the Sharpe ratio, highlighting the role of information acquisition about the reward distribution in driving learning efficiency. Then, we establish fundamental performance limits for the proposed algorithm \texttt{SRTS} in terms of an upper bound on regret. We also derive the matching lower bound and show the order-optimality. Our results show that Thompson Sampling achieves logarithmic regret over time, with distribution-dependent factors capturing the difficulty of distinguishing arms based on risk-adjusted performance. Empirical simulations show that our algorithm significantly outperforms existing algorithms.
The DeepLog Neurosymbolic Machine
Derkinderen, Vincent, Manhaeve, Robin, Adriaensen, Rik, Van Praet, Lucas, De Smet, Lennert, Marra, Giuseppe, De Raedt, Luc
We contribute a theoretical and operational framework for neurosymbolic AI called DeepLog. DeepLog introduces building blocks and primitives for neurosymbolic AI that make abstraction of commonly used representations and computational mechanisms used in neurosymbolic AI. DeepLog can represent and emulate a wide range of neurosymbolic systems. It consists of two key components. The first is the DeepLog language for specifying neurosymbolic models and inference tasks. This language consists of an annotated neural extension of grounded first-order logic, and makes abstraction of the type of logic, e.g. boolean, fuzzy or probabilistic, and whether logic is used in the architecture or in the loss function. The second DeepLog component is situated at the computational level and uses extended algebraic circuits as computational graphs. Together these two components are to be considered as a neurosymbolic abstract machine, with the DeepLog language as the intermediate level of abstraction and the circuits level as the computational one. DeepLog is implemented in software, relies on the latest insights in implementing algebraic circuits on GPUs, and is declarative in that it is easy to obtain different neurosymbolic models by making different choices for the underlying algebraic structures and logics. The generality and efficiency of the DeepLog neurosymbolic machine is demonstrated through an experimental comparison between 1) different fuzzy and probabilistic logics, 2) between using logic in the architecture or in the loss function, and 3) between a standalone CPU-based implementation of a neurosymbolic AI system and a DeepLog GPU-based one.
Prediction of Hospital Associated Infections During Continuous Hospital Stays
Datta, Rituparna, Kamruzzaman, Methun, Klein, Eili Y., Madden, Gregory R, Deng, Xinwei, Vullikanti, Anil, Bhattacharya, Parantapa
The US Centers for Disease Control and Prevention (CDC), in 2019, designated Methicillin-resistant Staphylococcus au-reus (MRSA) as a serious antimicrobial resistance threat. The risk of acquiring MRSA and suffering life-threatening consequences due to it remains especially high for hospitalized patients due to a unique combination of factors, including: co-morbid conditions, immunosuppression, and antibiotic use, and risk of contact with contaminated hospital workers and equipment. In this paper, we present a novel generative probabilistic model, GenHAI, for modeling sequences of MRSA test results outcomes for patients during a single hospitalization. This model can be used to answer many important questions from the perspectives of hospital administrators for mitigating the risk of MRSA infections. Our model is based on the probabilistic programming paradigm, and can be used to approximately answer a variety of predictive, causal, and counterfactual questions. We demonstrate the efficacy of our model by comparing it against discriminative and generative machine learning models using two real world datasets.
Uncertainty Tube Visualization of Particle Trajectories
Li, Jixian, Ouermi, Timbwaoga Aime Judicael, Han, Mengjiao, Johnson, Chris R.
This figure compares (a) a spaghetti plot of ensemble members, (b) a circular tube, and (c) our uncertainty tube for visualizing model uncertainty. Previous methods face challenges such as visual clutter (a) or the assumption of symmetric uncertainty (a, b), but our uncertainty tube (c), constructed using superellipses, provides a more accurate visualization of asymmetric uncertainty. Its superelliptical shape distinctly improves the visualization of the uncertainty orientation and its evolution along trajectories, as highlighted in the boxes. The visualization is further enhanced with a color palette that uses gray for low uncertainty, blue for large asymmetric uncertainty, and yellow for large symmetric uncertainty. Predicting particle trajectories with neural networks (NNs) has substantially enhanced many scientific and engineering domains. However, effectively quantifying and visualizing the inherent uncertainty in predictions remains challenging. Without an understanding of the uncertainty, the reliability of NN models in applications where trustworthiness is paramount is significantly compromised. This paper introduces the uncertainty tube, a novel, computationally efficient visualization method designed to represent this uncertainty in NN-derived particle paths. By integrating well-established uncertainty quantification techniques, such as Deep Ensembles, Monte Carlo Dropout (MC Dropout), and Stochastic Weight Averaging-Gaussian (SW AG), we demonstrate the practical utility of the uncertainty tube, showcasing its application on both synthetic and simulation datasets. Understanding and analyzing flow field data is fundamental for numerous scientific and engineering disciplines, including fluid dynamics, atmospheric science, and material processing. Traditional computational fluid dynamics (CFD) simulations are often computationally intensive, a limitation that has led researchers to explore more efficient paradigms. This exploration has given rise to neural networks (NNs) as a transformative tool in this domain, driven by their capacity to overcome these computational bottlenecks. Notably, recent work, such as Han et al. [26, 27], leverages NNs to learn Lagrangian-based flow maps, enabling efficient and robust particle tracing in time-varying fields. These data-driven models demonstrate remarkable accuracy and speed, making them increasingly indispensable for accelerating discovery and design cycles in fluid dynamics. Despite these advancements, a significant challenge remains in providing a comprehensive understanding of the confidence associated with NN predictions in flow fields.