Regression
Interpretable Neural Causal Models with TRAM-DAGs
The ultimate goal of most scientific studies is to understand the underlying causal mechanism between the involved variables. Structural causal models (SCMs) are widely used to represent such causal mechanisms. Given an SCM, causal queries on all three levels of Pearl's causal hierarchy can be answered: $L_1$ observational, $L_2$ interventional, and $L_3$ counterfactual. An essential aspect of modeling the SCM is to model the dependency of each variable on its causal parents. Traditionally this is done by parametric statistical models, such as linear or logistic regression models. This allows to handle all kinds of data types and fit interpretable models but bears the risk of introducing a bias. More recently neural causal models came up using neural networks (NNs) to model the causal relationships, allowing the estimation of nearly any underlying functional form without bias. However, current neural causal models are generally restricted to continuous variables and do not yield an interpretable form of the causal relationships. Transformation models range from simple statistical regressions to complex networks and can handle continuous, ordinal, and binary data. Here, we propose to use TRAMs to model the functional relationships in SCMs allowing us to bridge the gap between interpretability and flexibility in causal modeling. We call this method TRAM-DAG and assume currently that the underlying directed acyclic graph is known. For the fully observed case, we benchmark TRAM-DAGs against state-of-the-art statistical and NN-based causal models. We show that TRAM-DAGs are interpretable but also achieve equal or superior performance in queries ranging from $L_1$ to $L_3$ in the causal hierarchy. For the continuous case, TRAM-DAGs allow for counterfactual queries for three common causal structures, including unobserved confounding.
Investigating Cultural Dimensions and Technological Acceptance: The Adoption of Electronic Performance and Tracking Systems in Qatar's Football Sector
Qatar's football sector has undergone a substantial technological transformation with the implementation of Electronic Performance and Tracking Systems (EPTS). This study examines the impact of cultural and technological factors on EPTS adoption, using Hofstede's Cultural Dimensions Theory and the Technology Acceptance Model (TAM) as theoretical frameworks. An initial exploratory study involved ten participants, followed by an expanded dataset comprising thirty stakeholders, including players, coaches, and staff from Qatari football organizations. Multiple regression analysis was conducted to evaluate the relationships between perceived usefulness, perceived ease of use, power distance, innovation receptiveness, integration complexity, and overall adoption. The results indicate that perceived usefulness, innovation receptiveness, and lower power distance significantly drive EPTS adoption, while ease of use is marginally significant and integration complexity is non-significant in this sample. These findings provide practical insights for sports technology stakeholders in Qatar and emphasize the importance of aligning cultural considerations with technological readiness for successful EPTS integration.
Nonlinear Bayesian Update via Ensemble Kernel Regression with Clustering and Subsampling
Nonlinear Bayesian update for a prior ensemble is proposed to extend traditional ensemble Kalman filtering to settings characterized by non-Gaussian priors and nonlinear measurement operators. In this framework, the observed component is first denoised via a standard Kalman update, while the unobserved component is estimated using a nonlinear regression approach based on kernel density estimation. The method incorporates a subsampling strategy to ensure stability and, when necessary, employs unsupervised clustering to refine the conditional estimate. Numerical experiments on Lorenz systems and a PDE-constrained inverse problem illustrate that the proposed nonlinear update can reduce estimation errors compared to standard linear updates, especially in highly nonlinear scenarios.
Binary AddiVortes: (Bayesian) Additive Voronoi Tessellations for Binary Classification with an application to Predicting Home Mortgage Application Outcomes
Stone, Adam J., Ogundimu, Emmanuel, Gosling, John Paul
The Additive Voronoi Tessellations (AddiVortes) model is a multivariate regression model that uses multiple Voronoi tessellations to partition the covariate space for an additive ensemble model. In this paper, the AddiVortes framework is extended to binary classification by incorporating a probit model with a latent variable formulation. Specifically, we utilise a data augmentation technique, where a latent variable is introduced and the binary response is determined via thresholding. In most cases, the AddiVortes model outperforms random forests, BART and other leading black-box regression models when compared using a range of metrics. A comprehensive analysis is conducted using AddiVortes to predict an individual's likelihood of being approved for a home mortgage, based on a range of covariates. This evaluation highlights the model's effectiveness in capturing complex relationships within the data and its potential for improving decision-making in mortgage approval processes.
Doubly robust identification of treatment effects from multiple environments
De Bartolomeis, Piersilvio, Kostin, Julia, Abad, Javier, Wang, Yixin, Yang, Fanny
Treatment effects are key quantities of interest in applied domains such as medicine and social sciences, as they determine the impact of interventions like novel treatments or policies on outcomes of interest. To achieve this goal, researchers often rely on randomized trials since randomizing the treatment assignment guarantees unbiased treatment effect estimates under mild assumptions. However, methods relying on randomized data face several issues, such as small sample sizes, sample populations that do not reflect those seen in the real world, and ethical or financial constraints. As a result, there is growing interest in using observational data to estimate treatment effects. A fundamental challenge in using observational data is the selection of a valid adjustment set, i.e. a set of covariates that can be used to identify and estimate the treatment effect. Although criteria for identifying valid adjustment sets are well-established, they rely on the knowledge of the underlying causal graph. When the graph is not known, practitioners often adjust for all available covariates [5]. Yet, this approach runs the risk of including bad controls--covariates that open backdoor paths between the treatment (T) and the outcome (Y), thereby introducing bias into the treatment effect estimate.
Using 3D reconstruction from image motion to predict total leaf area in dwarf tomato plants
Usenko, Dmitrii, Helman, David, Giladi, Chen
Accurate estimation of total leaf area (TLA) is essential for assessing plant growth, photosynthetic activity, and transpiration but remains a challenge for bushy plants like dwarf tomatoes. Traditional destructive methods and imaging-based techniques often fall short due to labor intensity, plant damage, or the inability to capture complex canopies. This study evaluated a non-destructive method combining sequential 3D reconstructions from RGB images and machine learning to estimate TLA for three dwarf tomato cultivars-- Mohamed, Hahms Gelbe Topftomate, and Red Robin--grown under controlled greenhouse conditions. Two experiments, conducted in spring-summer and autumn-winter, included 73 plants, yielding 418 TLA measurements using an "onion" approach, where layers of leaves were sequentially removed and scanned. High-resolution videos were recorded from multiple angles for each plant, and 500 frames were extracted per plant for 3D reconstruction. Point clouds were created and processed, four reconstruction algorithms (Alpha Shape, Marching Cubes, Poisson's, and Ball Pivoting) were tested, and meshes were evaluated using seven regression models: Multivariable Linear Regression (MLR), Lasso Regression (Lasso), Ridge Regression (Ridge-Reg), Elastic Net Regression (ENR), Random Forest (RF), extreme gradient boosting (XGBoost), and Multilayer Perceptron (MLP). The Alpha Shape reconstruction (α = 3) combined with XGBoost yielded the best performance, achieving an R of 0.80 and MAE of 489 cm These findings demonstrate the robustness of our approach across variable environmental conditions and canopy structures. This scalable, automated TLA estimation method is particularly suited for urban farming and precision agriculture, offering practical implications for automated pruning, improved resource efficiency, and sustainable food production. Keywords: Total leaf area, dwarf tomato, point cloud, mesh reconstruction, machine learning, precision agriculture 1. Introduction Total leaf area (TLA) is a comprehensive metric describing the plant's growth and functioning. It is a primary metric that describes the plant's photosynthetic activity and transpiration capacity. Normalized by the plant's surface area, TLA may provide information on the canopy structure, which is crucial for understanding the plant's energy and resource efficiency. For example, reduced TLA is a sign of stress (Dong et al., 2019), while excessive biomass, indicated by a higher TLA, signifies lower water use efficiency (Glenn et al., 2006). Farmers often use pruning to reduce TLA in commercial crops to increase crop productivity (Budiarto et al., 2023). However, measuring and finding the optimum TLA of the crop are challenging tasks.
H-AddiVortes: Heteroscedastic (Bayesian) Additive Voronoi Tessellations
Stone, Adam J., Gosling, John Paul
This paper introduces the Heteroscedastic AddiVortes model, a Bayesian non-parametric regression framework that simultaneously models the conditional mean and variance of a response variable using adaptive Voronoi tessellations. By employing a sum-of-tessellations approach for the mean and a product-of-tessellations approach for the variance, the model provides a flexible and interpretable means to capture complex, predictor-dependent relationships and heteroscedastic patterns in data. This dual-layer representation enables precise inference, even in high-dimensional settings, while maintaining computational feasibility through efficient Markov Chain Monte Carlo (MCMC) sampling and conjugate prior structures. We illustrate the model's capability through both simulated and real-world datasets, demonstrating its ability to capture nuanced variance structures, provide reliable predictive uncertainty quantification, and highlight key predictors influencing both the mean response and its variability. Empirical results show that the Heteroscedastic AddiVortes model offers a substantial improvement in capturing distributional properties compared to both homoscedastic and heteroscedastic alternatives, making it a robust tool for complex regression problems in various applied settings.
March Madness Tournament Predictions Model: A Mathematical Modeling Approach
McIver, Christian, Avalos, Karla, Nayak, Nikhil
This paper proposes a model to predict the outcome of the March Madness tournament based on historical NCAA basketball data since 2013. The framework of this project is a simplification of the FiveThrityEight NCAA March Madness prediction model, where the only four predictors of interest are Adjusted Offensive Efficiency (ADJOE), Adjusted Defensive Efficiency (ADJDE), Power Rating, and Two-Point Shooting Percentage Allowed. A logistic regression was utilized with the aforementioned metrics to generate a probability of a particular team winning each game. Then, a tournament simulation is developed and compared to real-world March Madness brackets to determine the accuracy of the model. Accuracies of performance were calculated using a naive approach and a Spearman rank correlation coefficient.
Steinhaus Filtration and Stable Paths in the Mapper
Arendt, Dustin L., Broussard, Matthew, Krishnamoorthy, Bala, Saul, Nathaniel, Thrall, Amber
We define a new filtration called the Steinhaus filtration built from a single cover based on a generalized Steinhaus distance, a generalization of Jaccard distance. The homology persistence module of a Steinhaus filtration with infinitely many cover elements may not be $q$-tame, even when the covers are in a totally bounded space. While this may pose a challenge to derive stability results, we show that the Steinhaus filtration is stable when the cover is finite. We show that while the \v{C}ech and Steinhaus filtrations are not isomorphic in general, they are isomorphic for a finite point set in dimension one. Furthermore, the VR filtration completely determines the $1$-skeleton of the Steinhaus filtration in arbitrary dimension. We then develop a language and theory for stable paths within the Steinhaus filtration. We demonstrate how the framework can be applied to several applications where a standard metric may not be defined but a cover is readily available. We introduce a new perspective for modeling recommendation system datasets. As an example, we look at a movies dataset and we find the stable paths identified in our framework represent a sequence of movies constituting a gentle transition and ordering from one genre to another. For explainable machine learning, we apply the Mapper algorithm for model induction by building a filtration from a single Mapper complex, and provide explanations in the form of stable paths between subpopulations. For illustration, we build a Mapper complex from a supervised machine learning model trained on the FashionMNIST dataset. Stable paths in the Steinhaus filtration provide improved explanations of relationships between subpopulations of images.
Data-Driven Approximation of Binary-State Network Reliability Function: Algorithm Selection and Reliability Thresholds for Large-Scale Systems
While exact reliability computation for binarystate networks is NP-hard/#P-hard, existing approximation methods face critical tradeoffs between accuracy, scalability, and data efficiency. This study evaluates 20 machine learning methods across three reliability regimes--full range (0.0-1.0), high reliability (0.9-1.0), and ultra-high reliability (0.99-1.0)--to address these gaps. We demonstrate that large-scale networks with arc reliability 0.9 exhibit near-unity system reliability, enabling computational simplifications. Further, we establish a datasetscale-driven paradigm for algorithm selection: Artificial Neural Networks (ANN) excel with limited data (size < m), while Polynomial Regression (PR) achieves superior accuracy in data-rich environments (size m). Our findings reveal ANN's Test-MSE of 7.24E 05 at 30,000 samples and PR's optimal performance (5.61E 05) at 40,000 samples, outperforming traditional Monte Carlo simulations. These insights provide actionable guidelines for balancing accuracy, interpretability, and computational efficiency in reliability engineering, with implications for infrastructure resilience and system optimization. Keywords: Binary-State Networks; Network Reliability Approximated Function; Reliability Thresholds; Dataset Scalability; Artificial Neural Networks (ANN); Polynomial Regression; Monte Carlo Simulation (MCS); Binary-Addition-Tree Algorithm (BAT); BAT-MCS 1. INTRODUCTION Modern infrastructure systems--from power grids and communication networks to IoT ecosystems--demand rigorous reliability analysis to ensure operational resilience. These systems are often modeled as binary-state networks, where components (arcs/nodes) operate in either functional (1) or failed (0) states [1, 2, 3]. Within this paradigm, network reliability--the probability of maintaining 2 connectivity between specified nodes under given conditions--serves as a critical performance metric [4, 5-7].