Regression
Average Causal Effect Estimation in DAGs with Hidden Variables: Extensions of Back-Door and Front-Door Criteria
The identification theory for causal effects in directed acyclic graphs (DAGs) with hidden variables is well-developed, but methods for estimating and inferring functionals beyond the g-formula remain limited. Previous studies have proposed semiparametric estimators for identifiable functionals in a broad class of DAGs with hidden variables. While demonstrating double robustness in some models, existing estimators face challenges, particularly with density estimation and numerical integration for continuous variables, and their estimates may fall outside the parameter space of the target estimand. Their asymptotic properties are also underexplored, especially when using flexible statistical and machine learning models for nuisance estimation. This study addresses these challenges by introducing novel one-step corrected plug-in and targeted minimum loss-based estimators of causal effects for a class of DAGs that extend classical back-door and front-door criteria (known as the treatment primal fixability criterion in prior literature). These estimators leverage machine learning to minimize modeling assumptions while ensuring key statistical properties such as asymptotic linearity, double robustness, efficiency, and staying within the bounds of the target parameter space. We establish conditions for nuisance functional estimates in terms of L2(P)-norms to achieve root-n consistent causal effect estimates. To facilitate practical application, we have developed the flexCausal package in R.
Introduction to Machine Learning
This book introduces the mathematical foundations and techniques that lead to the development and analysis of many of the algorithms that are used in machine learning. It starts with an introductory chapter that describes notation used throughout the book and serve at a reminder of basic concepts in calculus, linear algebra and probability and also introduces some measure theoretic terminology, which can be used as a reading guide for the sections that use these tools. The introductory chapters also provide background material on matrix analysis and optimization. The latter chapter provides theoretical support to many algorithms that are used in the book, including stochastic gradient descent, proximal methods, etc. After discussing basic concepts for statistical prediction, the book includes an introduction to reproducing kernel theory and Hilbert space techniques, which are used in many places, before addressing the description of various algorithms for supervised statistical learning, including linear methods, support vector machines, decision trees, boosting, or neural networks. The subject then switches to generative methods, starting with a chapter that presents sampling methods and an introduction to the theory of Markov chains. The following chapter describe the theory of graphical models, an introduction to variational methods for models with latent variables, and to deep-learning based generative models. The next chapters focus on unsupervised learning methods, for clustering, factor analysis and manifold learning. The final chapter of the book is theory-oriented and discusses concentration inequalities and generalization bounds.
Demographic parity in regression and classification within the unawareness framework
Divol, Vincent, Gaucher, Solenne
This paper explores the theoretical foundations of fair regression under the constraint of demographic parity within the unawareness framework, where disparate treatment is prohibited, extending existing results where such treatment is permitted. Specifically, we aim to characterize the optimal fair regression function when minimizing the quadratic loss. Our results reveal that this function is given by the solution to a barycenter problem with optimal transport costs. Additionally, we study the connection between optimal fair cost-sensitive classification, and optimal fair regression. We demonstrate that nestedness of the decision sets of the classifiers is both necessary and sufficient to establish a form of equivalence between classification and regression. Under this nestedness assumption, the optimal classifiers can be derived by applying thresholds to the optimal fair regression function; conversely, the optimal fair regression function is characterized by the family of cost-sensitive classifiers.
Uncertainty Quantification Using Ensemble Learning and Monte Carlo Sampling for Performance Prediction and Monitoring in Cell Culture Processes
Khuat, Thanh Tung, Bassett, Robert, Otte, Ellen, Gabrys, Bogdan
Biopharmaceutical products, particularly monoclonal antibodies (mAbs), have gained prominence in the pharmaceutical market due to their high specificity and efficacy. As these products are projected to constitute a substantial portion of global pharmaceutical sales, the application of machine learning models in mAb development and manufacturing is gaining momentum. This paper addresses the critical need for uncertainty quantification in machine learning predictions, particularly in scenarios with limited training data. Leveraging ensemble learning and Monte Carlo simulations, our proposed method generates additional input samples to enhance the robustness of the model in small training datasets. We evaluate the efficacy of our approach through two case studies: predicting antibody concentrations in advance and real-time monitoring of glucose concentrations during bioreactor runs using Raman spectra data. Our findings demonstrate the effectiveness of the proposed method in estimating the uncertainty levels associated with process performance predictions and facilitating real-time decision-making in biopharmaceutical manufacturing. This contribution not only introduces a novel approach for uncertainty quantification but also provides insights into overcoming challenges posed by small training datasets in bioprocess development. The evaluation demonstrates the effectiveness of our method in addressing key challenges related to uncertainty estimation within upstream cell cultivation, illustrating its potential impact on enhancing process control and product quality in the dynamic field of biopharmaceuticals.
Federated Prediction-Powered Inference from Decentralized Data
Luo, Ping, Deng, Xiaoge, Wen, Ziqing, Sun, Tao, Li, Dongsheng
In various domains, the increasing application of machine learning allows researchers to access inexpensive predictive data, which can be utilized as auxiliary data for statistical inference. Although such data are often unreliable compared to gold-standard datasets, Prediction-Powered Inference (PPI) has been proposed to ensure statistical validity despite the unreliability. However, the challenge of `data silos' arises when the private gold-standard datasets are non-shareable for model training, leading to less accurate predictive models and invalid inferences. In this paper, we introduces the Federated Prediction-Powered Inference (Fed-PPI) framework, which addresses this challenge by enabling decentralized experimental data to contribute to statistically valid conclusions without sharing private information. The Fed-PPI framework involves training local models on private data, aggregating them through Federated Learning (FL), and deriving confidence intervals using PPI computation. The proposed framework is evaluated through experiments, demonstrating its effectiveness in producing valid confidence intervals.
Pump and Dumps in the Bitcoin Era: Real Time Detection of Cryptocurrency Market Manipulations
La Morgia, Massimo, Mei, Alessandro, Sassi, Francesco, Stefa, Julinda
In the last years, cryptocurrencies are increasingly popular. Even people who are not experts have started to invest in these securities and nowadays cryptocurrency exchanges process transactions for over 100 billion US dollars per month. However, many cryptocurrencies have low liquidity and therefore they are highly prone to market manipulation schemes. In this paper, we perform an in-depth analysis of pump and dump schemes organized by communities over the Internet. We observe how these communities are organized and how they carry out the fraud. Then, we report on two case studies related to pump and dump groups. Lastly, we introduce an approach to detect the fraud in real time that outperforms the current state of the art, so to help investors stay out of the market when a pump and dump scheme is in action.
Double Machine Learning meets Panel Data -- Promises, Pitfalls, and Potential Solutions
Fuhr, Jonathan, Papies, Dominik
Estimating causal effect using machine learning (ML) algorithms can help to relax functional form assumptions if used within appropriate frameworks. However, most of these frameworks assume settings with cross-sectional data, whereas researchers often have access to panel data, which in traditional methods helps to deal with unobserved heterogeneity between units. In this paper, we explore how we can adapt double/debiased machine learning (DML) (Chernozhukov et al., 2018) for panel data in the presence of unobserved heterogeneity. This adaptation is challenging because DML's cross-fitting procedure assumes independent data and the unobserved heterogeneity is not necessarily additively separable in settings with nonlinear observed confounding. We assess the performance of several intuitively appealing estimators in a variety of simulations. While we find violations of the cross-fitting assumptions to be largely inconsequential for the accuracy of the effect estimates, many of the considered methods fail to adequately account for the presence of unobserved heterogeneity. However, we find that using predictive models based on the correlated random effects approach (Mundlak, 1978) within DML leads to accurate coefficient estimates across settings, given a sample size that is large relative to the number of observed confounders. We also show that the influence of the unobserved heterogeneity on the observed confounders plays a significant role for the performance of most alternative methods.
Global Public Sentiment on Decentralized Finance: A Spatiotemporal Analysis of Geo-tagged Tweets from 150 Countries
Chen, Yuqi, Li, Yifan, Zhou, Kyrie Zhixuan, Fu, Xiaokang, Liu, Lingbo, Bao, Shuming, Sui, Daniel, Zhang, Luyao
In the digital era, blockchain technology, cryptocurrencies, and non-fungible tokens (NFTs) have transformed financial and decentralized systems. However, existing research often neglects the spatiotemporal variations in public sentiment toward these technologies, limiting macro-level insights into their global impact. This study leverages Twitter data to explore public attention and sentiment across 150 countries, analyzing over 150 million geotagged tweets from 2012 to 2022. Sentiment scores were derived using a BERT-based multilingual sentiment model trained on 7.4 billion tweets. The analysis integrates global cryptocurrency regulations and economic indicators from the World Development Indicators database. Results reveal significant global sentiment variations influenced by economic factors, with more developed nations engaging more in discussions, while less developed countries show higher sentiment levels. Geographically weighted regression indicates that GDP-tweet engagement correlation intensifies following Bitcoin price surges. Topic modeling shows that countries within similar economic clusters share discussion trends, while different clusters focus on distinct topics. This study highlights global disparities in sentiment toward decentralized finance, shaped by economic and regional factors, with implications for poverty alleviation, cryptocurrency crime, and sustainable development. The dataset and code are publicly available on GitHub.
Identifying Factors to Help Improve Existing Decomposition-Based PMI Estimation Methods
Nau, Anna-Maria, Ditto, Phillip, Steadman, Dawnie Wolfe, Mockus, Audris
Accurately assessing the postmortem interval (PMI) is an important task in forensic science. Some of the existing techniques use regression models that use a decomposition score to predict the PMI or accumulated degree days (ADD), however, the provided formulas are based on very small samples and the accuracy is low. With the advent of Big Data, much larger samples can be used to improve PMI estimation methods. We, therefore, aim to investigate ways to improve PMI prediction accuracy by (a) using a much larger sample size, (b) employing more advanced linear models, and (c) enhancing models with factors known to affect the human decay process. Specifically, this study involved the curation of a sample of 249 human subjects from a large-scale decomposition dataset, followed by evaluating pre-existing PMI/ADD formulas and fitting increasingly sophisticated models to estimate the PMI/ADD. Results showed that including the total decomposition score (TDS), demographic factors (age, biological sex, and BMI), and weather-related factors (season of discovery, temperature history, and humidity history) increased the accuracy of the PMI/ADD models. Furthermore, the best performing PMI estimation model using the TDS, demographic, and weather-related features as predictors resulted in an adjusted R-squared of 0.34 and an RMSE of 0.95. It had a 7% lower RMSE than a model using only the TDS to predict the PMI and a 48% lower RMSE than the pre-existing PMI formula. The best ADD estimation model, also using the TDS, demographic, and weather-related features as predictors, resulted in an adjusted R-squared of 0.52 and an RMSE of 0.89. It had an 11% lower RMSE than the model using only the TDS to predict the ADD and a 52% lower RMSE than the pre-existing ADD formula. This work demonstrates the need (and way) to incorporate demographic and environmental factors into PMI/ADD estimation models.
Multi-Output Distributional Fairness via Post-Processing
Li, Gang, Lin, Qihang, Ghosh, Ayush, Yang, Tianbao
The post-processing approaches are becoming prominent techniques to enhance machine learning models' fairness because of their intuitiveness, low computational cost, and excellent scalability. However, most existing post-processing methods are designed for task-specific fairness measures and are limited to single-output models. In this paper, we introduce a post-processing method for multi-output models, such as the ones used for multi-task/multi-class classification and representation learning, to enhance a model's distributional parity, a task-agnostic fairness measure. Existing techniques to achieve distributional parity are based on the (inverse) cumulative density function of a model's output, which is limited to single-output models. Extending previous works, our method employs an optimal transport mapping to move a model's outputs across different groups towards their empirical Wasserstein barycenter. An approximation technique is applied to reduce the complexity of computing the exact barycenter and a kernel regression method is proposed for extending this process to out-of-sample data. Our empirical studies, which compare our method to current existing post-processing baselines on multi-task/multi-class classification and representation learning tasks, demonstrate the effectiveness of the proposed approach.