Regression
Finite-sample performance of the maximum likelihood estimator in logistic regression
Chardon, Hugo, Lerasle, Matthieu, Mourtada, Jaouad
Logistic regression is a classical model for describing the probabilistic dependence of binary responses to multivariate covariates. We consider the predictive performance of the maximum likelihood estimator (MLE) for logistic regression, assessed in terms of logistic risk. We consider two questions: first, that of the existence of the MLE (which occurs when the dataset is not linearly separated), and second that of its accuracy when it exists. These properties depend on both the dimension of covariates and on the signal strength. In the case of Gaussian covariates and a well-specified logistic model, we obtain sharp non-asymptotic guarantees for the existence and excess logistic risk of the MLE. We then generalize these results in two ways: first, to non-Gaussian covariates satisfying a certain two-dimensional margin condition, and second to the general case of statistical learning with a possibly misspecified logistic model. Finally, we consider the case of a Bernoulli design, where the behavior of the MLE is highly sensitive to the parameter direction.
Utilizing Machine Learning Models to Predict Acute Kidney Injury in Septic Patients from MIMIC-III Database
Roknaldin, Aleyeh, Zhang, Zehao, Xu, Jiayuan, Alaei, Kamiar, Pishgar, Maryam
Sepsis is a severe condition that causes the body to respond incorrectly to an infection. This reaction can subsequently cause organ failure, a major one being acute kidney injury (AKI). For septic patients, approximately 50% develop AKI, with a mortality rate above 40%. Creating models that can accurately predict AKI based on specific qualities of septic patients is crucial for early detection and intervention. Using medical data from septic patients during intensive care unit (ICU) admission from the Medical Information Mart for Intensive Care 3 (MIMIC-III) database, we extracted 3301 patients with sepsis, with 73% of patients developing AKI. The data was randomly divided into a training set (n = 1980, 40%), a test set (n = 661, 10%), and a validation set (n = 660, 50%). The proposed model was logistic regression, and it was compared against five baseline models: XGBoost, K Nearest Neighbors (KNN), Support Vector Machines (SVM), Random Forest (RF), and LightGBM. Area Under the Curve (AUC), Accuracy, F1-Score, and Recall were calculated for each model. After analysis, we were able to select 23 features to include in our model, the top features being urine output, maximum bilirubin, minimum bilirubin, weight, maximum blood urea nitrogen, and minimum estimated glomerular filtration rate. The logistic regression model performed the best, achieving an AUC score of 0.887 (95% CI: [0.861-0.915]), an accuracy of 0.817, an F1 score of 0.866, a recall score of 0.827, and a Brier score of 0.13. Compared to the best existing literature in this field, our model achieved an 8.57% improvement in AUC while using 13 fewer variables, showcasing its effectiveness in determining AKI in septic patients. While the features selected for predicting AKI in septic patients are similar to previous literature, the top features that influenced our model's performance differ.
Using Platt's scaling for calibration after undersampling -- limitations and how to address them
Phelps, Nathan, Lizotte, Daniel J., Woolford, Douglas G.
When modelling data where the response is dichotomous and highly imbalanced, response-based sampling where a subset of the majority class is retained (i.e., undersampling) is often used to create more balanced training datasets prior to modelling. However, the models fit to this undersampled data, which we refer to as base models, generate predictions that are severely biased. There are several calibration methods that can be used to combat this bias, one of which is Platt's scaling. Here, a logistic regression model is used to model the relationship between the base model's original predictions and the response. Despite its popularity for calibrating models after undersampling, Platt's scaling was not designed for this purpose. Our work presents what we believe is the first detailed study focused on the validity of using Platt's scaling to calibrate models after undersampling. We show analytically, as well as via a simulation study and a case study, that Platt's scaling should not be used for calibration after undersampling without critical thought. If Platt's scaling would have been able to successfully calibrate the base model had it been trained on the entire dataset (i.e., without undersampling), then Platt's scaling might be appropriate for calibration after undersampling. If this is not the case, we recommend a modified version of Platt's scaling that fits a logistic generalized additive model to the logit of the base model's predictions, as it is both theoretically motivated and performed well across the settings considered in our study.
Bounds in Wasserstein distance for locally stationary processes
Tinio, Jan Nino G., Alaya, Mokhtar Z., Bouzebda, Salim
Locally stationary processes (LSPs) provide a robust framework for modeling time-varying phenomena, allowing for smooth variations in statistical properties such as mean and variance over time. In this paper, we address the estimation of the conditional probability distribution of LSPs using Nadaraya-Watson (NW) type estimators. The NW estimator approximates the conditional distribution of a target variable given covariates through kernel smoothing techniques. We establish the convergence rate of the NW conditional probability estimator for LSPs in the univariate setting under the Wasserstein distance and extend this analysis to the multivariate case using the sliced Wasserstein distance. Theoretical results are supported by numerical experiments on both synthetic and real-world datasets, demonstrating the practical usefulness of the proposed estimators.
Marginal Causal Flows for Validation and Inference
Manela, Daniel de Vassimon, Battaglia, Laura, Evans, Robin J.
Investigating the marginal causal effect of an intervention on an outcome from complex data remains challenging due to the inflexibility of employed models and the lack of complexity in causal benchmark datasets, which often fail to reproduce intricate real-world data patterns. In this paper we introduce Frugal Flows, a novel likelihood-based machine learning model that uses normalising flows to flexibly learn the data-generating process, while also directly inferring the marginal causal quantities from observational data. We propose that these models are exceptionally well suited for generating synthetic data to validate causal methods. They can create synthetic datasets that closely resemble the empirical dataset, while automatically and exactly satisfying a user-defined average treatment effect. To our knowledge, Frugal Flows are the first generative model to both learn flexible data representations and also exactly parameterise quantities such as the average treatment effect and the degree of unobserved confounding. We demonstrate the above with experiments on both simulated and real-world datasets.
Deep Learning, Machine Learning, Advancing Big Data Analytics and Management
Hsieh, Weiche, Bi, Ziqian, Chen, Keyu, Peng, Benji, Zhang, Sen, Xu, Jiawei, Wang, Jinlang, Yin, Caitlyn Heqi, Zhang, Yichao, Feng, Pohsun, Wen, Yizhu, Wang, Tianyang, Li, Ming, Liang, Chia Xin, Ren, Jintao, Niu, Qian, Chen, Silin, Yan, Lawrence K. Q., Xu, Han, Tseng, Hong-Ming, Song, Xinyuan, Jing, Bowen, Yang, Junjie, Song, Junhao, Liu, Junyu, Liu, Ming
Advancements in artificial intelligence, machine learning, and deep learning have catalyzed the transformation of big data analytics and management into pivotal domains for research and application. This work explores the theoretical foundations, methodological advancements, and practical implementations of these technologies, emphasizing their role in uncovering actionable insights from massive, high-dimensional datasets. The study presents a systematic overview of data preprocessing techniques, including data cleaning, normalization, integration, and dimensionality reduction, to prepare raw data for analysis. Core analytics methodologies such as classification, clustering, regression, and anomaly detection are examined, with a focus on algorithmic innovation and scalability. Furthermore, the text delves into state-of-the-art frameworks for data mining and predictive modeling, highlighting the role of neural networks, support vector machines, and ensemble methods in tackling complex analytical challenges. Special emphasis is placed on the convergence of big data with distributed computing paradigms, including cloud and edge computing, to address challenges in storage, computation, and real-time analytics. The integration of ethical considerations, including data privacy and compliance with global standards, ensures a holistic perspective on data management. Practical applications across healthcare, finance, marketing, and policy-making illustrate the real-world impact of these technologies. Through comprehensive case studies and Python-based implementations, this work equips researchers, practitioners, and data enthusiasts with the tools to navigate the complexities of modern data analytics. It bridges the gap between theory and practice, fostering the development of innovative solutions for managing and leveraging data in the era of artificial intelligence.
The Space Complexity of Approximating Logistic Loss
Dexter, Gregory, Drineas, Petros, Khanna, Rajiv
We provide space complexity lower bounds for data structures that approximate logistic loss up to $\epsilon$-relative error on a logistic regression problem with data $\mathbf{X} \in \mathbb{R}^{n \times d}$ and labels $\mathbf{y} \in \{-1,1\}^d$. The space complexity of existing coreset constructions depend on a natural complexity measure $\mu_\mathbf{y}(\mathbf{X})$, first defined in (Munteanu, 2018). We give an $\tilde{\Omega}(\frac{d}{\epsilon^2})$ space complexity lower bound in the regime $\mu_\mathbf{y}(\mathbf{X}) = O(1)$ that shows existing coresets are optimal in this regime up to lower order factors. We also prove a general $\tilde{\Omega}(d\cdot \mu_\mathbf{y}(\mathbf{X}))$ space lower bound when $\epsilon$ is constant, showing that the dependency on $\mu_\mathbf{y}(\mathbf{X})$ is not an artifact of mergeable coresets. Finally, we refute a prior conjecture that $\mu_\mathbf{y}(\mathbf{X})$ is hard to compute by providing an efficient linear programming formulation, and we empirically compare our algorithm to prior approximate methods.
Private Linear Regression with Differential Privacy and PAC Privacy
Linear regression is a fundamental tool for statistical analysis, which has motivated the development of linear regression methods that satisfy provable privacy guarantees so that the learned model reveals little about any one data point used to construct it. Most existing privacy-preserving linear regression methods rely on the well-established framework of differential privacy, while the newly proposed PAC Privacy has not yet been explored in this context. In this paper, we systematically compare linear regression models trained with differential privacy and PAC privacy across three real-world datasets, observing several key findings that impact the performance of privacy-preserving linear regression.
Comparative Performance of Machine Learning Algorithms for Early Genetic Disorder and Subclass Classification
Siddik, Abu Bakar, Badal, Faisal R., Islam, Afroza
A great deal of effort has been devoted to discovering a particular genetic disorder, but its classification across a broad spectrum of disorder classes and types remains elusive. Early diagnosis of genetic disorders enables timely interventions and improves outcomes. This study implements machine learning models using basic clinical indicators measurable at birth or infancy to enable diagnosis in preliminary life stages. Supervised learning algorithms were implemented on a dataset of 22083 instances with 42 features like family history, newborn metrics, and basic lab tests. Extensive hyperparameter tuning, feature engineering, and selection were undertaken. Two multi-class classifiers were developed: one for predicting disorder classes (mitochondrial, multifactorial, and single-gene) and one for subtypes (9 disorders). Performance was evaluated using accuracy, precision, recall, and the F1-score. The CatBoost classifier achieved the highest accuracy of 77% for predicting genetic disorder classes. For subtypes, SVM attained a maximum accuracy of 80%. The study demonstrates the feasibility of using basic clinical data in machine learning models for early categorization and diagnosis across various genetic disorders. Applying ML with basic clinical indicators can enable timely interventions once validated on larger datasets. It is necessary to conduct further studies to improve model performance on this dataset.
Mining Tweets to Predict Future Bitcoin Price
Hathidara, Ashutosh, Atavale, Gaurav, Chaudhary, Suyash
Bitcoin has increased investment interests in people during the last decade. We have seen an increase in the number of posts on social media platforms about cryptocurrency, especially Bitcoin. This project focuses on analyzing user tweet data in combination with Bitcoin price data to see the relevance between price fluctuations and the conversation between millions of people on Twitter. This study also exploits this relationship between user tweets and bitcoin prices to predict the future bitcoin price. We are utilizing novel techniques and methods to analyze the data and make price predictions.