AITopics | Regression

Collaborating Authors

Regression

News Overviews Instructional Materials AI-Alerts Classics

Finite-sample performance of the maximum likelihood estimator in logistic regression

Chardon, Hugo, Lerasle, Matthieu, Mourtada, Jaouad

arXiv.org Machine LearningDec-5-2024

Logistic regression is a classical model for describing the probabilistic dependence of binary responses to multivariate covariates. We consider the predictive performance of the maximum likelihood estimator (MLE) for logistic regression, assessed in terms of logistic risk. We consider two questions: first, that of the existence of the MLE (which occurs when the dataset is not linearly separated), and second that of its accuracy when it exists. These properties depend on both the dimension of covariates and on the signal strength. In the case of Gaussian covariates and a well-specified logistic model, we obtain sharp non-asymptotic guarantees for the existence and excess logistic risk of the MLE. We then generalize these results in two ways: first, to non-Gaussian covariates satisfying a certain two-dimensional margin condition, and second to the general case of statistical learning with a possibly misspecified logistic model. Finally, we consider the case of a Bernoulli design, where the behavior of the MLE is highly sensitive to the parameter direction.

inequality, logistic regression, probability, (16 more...)

arXiv.org Machine Learning

2411.02137

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
North America > United States > New York (0.04)
(3 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.90)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.84)

Add feedback

Utilizing Machine Learning Models to Predict Acute Kidney Injury in Septic Patients from MIMIC-III Database

Roknaldin, Aleyeh, Zhang, Zehao, Xu, Jiayuan, Alaei, Kamiar, Pishgar, Maryam

arXiv.org Artificial IntelligenceDec-4-2024

Sepsis is a severe condition that causes the body to respond incorrectly to an infection. This reaction can subsequently cause organ failure, a major one being acute kidney injury (AKI). For septic patients, approximately 50% develop AKI, with a mortality rate above 40%. Creating models that can accurately predict AKI based on specific qualities of septic patients is crucial for early detection and intervention. Using medical data from septic patients during intensive care unit (ICU) admission from the Medical Information Mart for Intensive Care 3 (MIMIC-III) database, we extracted 3301 patients with sepsis, with 73% of patients developing AKI. The data was randomly divided into a training set (n = 1980, 40%), a test set (n = 661, 10%), and a validation set (n = 660, 50%). The proposed model was logistic regression, and it was compared against five baseline models: XGBoost, K Nearest Neighbors (KNN), Support Vector Machines (SVM), Random Forest (RF), and LightGBM. Area Under the Curve (AUC), Accuracy, F1-Score, and Recall were calculated for each model. After analysis, we were able to select 23 features to include in our model, the top features being urine output, maximum bilirubin, minimum bilirubin, weight, maximum blood urea nitrogen, and minimum estimated glomerular filtration rate. The logistic regression model performed the best, achieving an AUC score of 0.887 (95% CI: [0.861-0.915]), an accuracy of 0.817, an F1 score of 0.866, a recall score of 0.827, and a Brier score of 0.13. Compared to the best existing literature in this field, our model achieved an 8.57% improvement in AUC while using 13 fewer variables, showcasing its effectiveness in determining AKI in septic patients. While the features selected for predicting AKI in septic patients are similar to previous literature, the top features that influenced our model's performance differ.

acute kidney injury, aki, septic patient, (14 more...)

arXiv.org Artificial Intelligence

2412.03737

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
Asia > Middle East > Israel (0.04)
North America > United States > Massachusetts > Suffolk County > Boston (0.04)
(4 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Nephrology (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.79)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.54)

Add feedback

Using Platt's scaling for calibration after undersampling -- limitations and how to address them

Phelps, Nathan, Lizotte, Daniel J., Woolford, Douglas G.

arXiv.org Artificial IntelligenceDec-4-2024

When modelling data where the response is dichotomous and highly imbalanced, response-based sampling where a subset of the majority class is retained (i.e., undersampling) is often used to create more balanced training datasets prior to modelling. However, the models fit to this undersampled data, which we refer to as base models, generate predictions that are severely biased. There are several calibration methods that can be used to combat this bias, one of which is Platt's scaling. Here, a logistic regression model is used to model the relationship between the base model's original predictions and the response. Despite its popularity for calibrating models after undersampling, Platt's scaling was not designed for this purpose. Our work presents what we believe is the first detailed study focused on the validity of using Platt's scaling to calibrate models after undersampling. We show analytically, as well as via a simulation study and a case study, that Platt's scaling should not be used for calibration after undersampling without critical thought. If Platt's scaling would have been able to successfully calibrate the base model had it been trained on the entire dataset (i.e., without undersampling), then Platt's scaling might be appropriate for calibration after undersampling. If this is not the case, we recommend a modified version of Platt's scaling that fits a logistic generalized additive model to the logit of the base model's predictions, as it is both theoretically motivated and performed well across the settings considered in our study.

base model, platt, probability, (13 more...)

arXiv.org Artificial Intelligence

2410.18144

Country:

North America > Canada > Ontario > Middlesex County > London (0.14)
Europe > Austria > Vienna (0.14)
South America > Paraguay > Asunción > Asunción (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.56)

Add feedback

Bounds in Wasserstein distance for locally stationary processes

Tinio, Jan Nino G., Alaya, Mokhtar Z., Bouzebda, Salim

arXiv.org Machine LearningDec-4-2024

Locally stationary processes (LSPs) provide a robust framework for modeling time-varying phenomena, allowing for smooth variations in statistical properties such as mean and variance over time. In this paper, we address the estimation of the conditional probability distribution of LSPs using Nadaraya-Watson (NW) type estimators. The NW estimator approximates the conditional distribution of a target variable given covariates through kernel smoothing techniques. We establish the convergence rate of the NW conditional probability estimator for LSPs in the univariate setting under the Wasserstein distance and extend this analysis to the multivariate case using the sliced Wasserstein distance. Theoretical results are supported by numerical experiments on both synthetic and real-world datasets, demonstrating the practical usefulness of the proposed estimators.

estimation, estimator, wasserstein distance, (15 more...)

arXiv.org Machine Learning

2412.03414

Country:

Europe > France > Hauts-de-France > Oise > Compiègne (0.04)
Europe > Netherlands > South Holland > Delft (0.04)
South America > Chile > Araucanía Region > Malleco Province (0.04)
(6 more...)

Genre: Research Report (0.63)

Industry:

Health & Medicine (1.00)
Banking & Finance > Economy (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

Marginal Causal Flows for Validation and Inference

Manela, Daniel de Vassimon, Battaglia, Laura, Evans, Robin J.

arXiv.org Machine LearningDec-4-2024

Investigating the marginal causal effect of an intervention on an outcome from complex data remains challenging due to the inflexibility of employed models and the lack of complexity in causal benchmark datasets, which often fail to reproduce intricate real-world data patterns. In this paper we introduce Frugal Flows, a novel likelihood-based machine learning model that uses normalising flows to flexibly learn the data-generating process, while also directly inferring the marginal causal quantities from observational data. We propose that these models are exceptionally well suited for generating synthetic data to validate causal methods. They can create synthetic datasets that closely resemble the empirical dataset, while automatically and exactly satisfying a user-defined average treatment effect. To our knowledge, Frugal Flows are the first generative model to both learn flexible data representations and also exactly parameterise quantities such as the average treatment effect and the degree of unobserved confounding. We demonstrate the above with experiments on both simulated and real-world datasets.

covariate, dataset, frugal flow, (16 more...)

arXiv.org Machine Learning

2411.01295

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > Colorado > Denver County > Denver (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre:

Research Report > Experimental Study (0.46)
Research Report > New Finding (0.46)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

Deep Learning, Machine Learning, Advancing Big Data Analytics and Management

Hsieh, Weiche, Bi, Ziqian, Chen, Keyu, Peng, Benji, Zhang, Sen, Xu, Jiawei, Wang, Jinlang, Yin, Caitlyn Heqi, Zhang, Yichao, Feng, Pohsun, Wen, Yizhu, Wang, Tianyang, Li, Ming, Liang, Chia Xin, Ren, Jintao, Niu, Qian, Chen, Silin, Yan, Lawrence K. Q., Xu, Han, Tseng, Hong-Ming, Song, Xinyuan, Jing, Bowen, Yang, Junjie, Song, Junhao, Liu, Junyu, Liu, Ming

arXiv.org Artificial IntelligenceDec-3-2024

Advancements in artificial intelligence, machine learning, and deep learning have catalyzed the transformation of big data analytics and management into pivotal domains for research and application. This work explores the theoretical foundations, methodological advancements, and practical implementations of these technologies, emphasizing their role in uncovering actionable insights from massive, high-dimensional datasets. The study presents a systematic overview of data preprocessing techniques, including data cleaning, normalization, integration, and dimensionality reduction, to prepare raw data for analysis. Core analytics methodologies such as classification, clustering, regression, and anomaly detection are examined, with a focus on algorithmic innovation and scalability. Furthermore, the text delves into state-of-the-art frameworks for data mining and predictive modeling, highlighting the role of neural networks, support vector machines, and ensemble methods in tackling complex analytical challenges. Special emphasis is placed on the convergence of big data with distributed computing paradigms, including cloud and edge computing, to address challenges in storage, computation, and real-time analytics. The integration of ethical considerations, including data privacy and compliance with global standards, ensures a holistic perspective on data management. Practical applications across healthcare, finance, marketing, and policy-making illustrate the real-world impact of these technologies. Through comprehensive case studies and Python-based implementations, this work equips researchers, practitioners, and data enthusiasts with the tools to navigate the complexities of modern data analytics. It bridges the gap between theory and practice, fostering the development of innovative solutions for managing and leveraging data in the era of artificial intelligence.

data mining, information retrieval, machine learning, (25 more...)

arXiv.org Artificial Intelligence

2412.02187

Country:

Europe (0.67)
Asia > China (0.45)
North America > United States > Wisconsin (0.14)
(2 more...)

Genre:

Workflow (1.00)
Overview (1.00)
Research Report > Experimental Study (0.67)

Industry:

Transportation (1.00)
Leisure & Entertainment (1.00)
Information Technology > Services (1.00)
(10 more...)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
(4 more...)

Add feedback

The Space Complexity of Approximating Logistic Loss

Dexter, Gregory, Drineas, Petros, Khanna, Rajiv

arXiv.org Artificial IntelligenceDec-3-2024

We provide space complexity lower bounds for data structures that approximate logistic loss up to $\epsilon$-relative error on a logistic regression problem with data $\mathbf{X} \in \mathbb{R}^{n \times d}$ and labels $\mathbf{y} \in \{-1,1\}^d$. The space complexity of existing coreset constructions depend on a natural complexity measure $\mu_\mathbf{y}(\mathbf{X})$, first defined in (Munteanu, 2018). We give an $\tilde{\Omega}(\frac{d}{\epsilon^2})$ space complexity lower bound in the regime $\mu_\mathbf{y}(\mathbf{X}) = O(1)$ that shows existing coresets are optimal in this regime up to lower order factors. We also prove a general $\tilde{\Omega}(d\cdot \mu_\mathbf{y}(\mathbf{X}))$ space lower bound when $\epsilon$ is constant, showing that the dependency on $\mu_\mathbf{y}(\mathbf{X})$ is not an artifact of mergeable coresets. Finally, we refute a prior conjecture that $\mu_\mathbf{y}(\mathbf{X})$ is hard to compute by providing an efficient linear programming formulation, and we empirically compare our algorithm to prior approximate methods.

approximation, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2412.02639

Country: North America > United States > Indiana (0.04)

Genre: Research Report > New Finding (0.89)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)

Add feedback

Private Linear Regression with Differential Privacy and PAC Privacy

Yang, Hillary

arXiv.org Artificial IntelligenceDec-3-2024

Linear regression is a fundamental tool for statistical analysis, which has motivated the development of linear regression methods that satisfy provable privacy guarantees so that the learned model reveals little about any one data point used to construct it. Most existing privacy-preserving linear regression methods rely on the well-established framework of differential privacy, while the newly proposed PAC Privacy has not yet been explored in this context. In this paper, we systematically compare linear regression models trained with differential privacy and PAC privacy across three real-world datasets, observing several key findings that impact the performance of privacy-preserving linear regression.

linear regression, privacy, regression, (14 more...)

arXiv.org Artificial Intelligence

2412.02578

Country:

North America > United States > Indiana (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)

Genre: Research Report (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)

Add feedback

Comparative Performance of Machine Learning Algorithms for Early Genetic Disorder and Subclass Classification

Siddik, Abu Bakar, Badal, Faisal R., Islam, Afroza

arXiv.org Artificial IntelligenceDec-3-2024

A great deal of effort has been devoted to discovering a particular genetic disorder, but its classification across a broad spectrum of disorder classes and types remains elusive. Early diagnosis of genetic disorders enables timely interventions and improves outcomes. This study implements machine learning models using basic clinical indicators measurable at birth or infancy to enable diagnosis in preliminary life stages. Supervised learning algorithms were implemented on a dataset of 22083 instances with 42 features like family history, newborn metrics, and basic lab tests. Extensive hyperparameter tuning, feature engineering, and selection were undertaken. Two multi-class classifiers were developed: one for predicting disorder classes (mitochondrial, multifactorial, and single-gene) and one for subtypes (9 disorders). Performance was evaluated using accuracy, precision, recall, and the F1-score. The CatBoost classifier achieved the highest accuracy of 77% for predicting genetic disorder classes. For subtypes, SVM attained a maximum accuracy of 80%. The study demonstrates the feasibility of using basic clinical data in machine learning models for early categorization and diagnosis across various genetic disorders. Applying ML with basic clinical indicators can enable timely interventions once validated on larger datasets. It is necessary to conduct further studies to improve model performance on this dataset.

algorithm, classifier, disorder, (17 more...)

arXiv.org Artificial Intelligence

2412.02189

Country: Asia > Bangladesh (0.04)

Genre: Research Report > Experimental Study (0.90)

Industry: Health & Medicine > Therapeutic Area > Genetic Disease (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.70)

Add feedback

Mining Tweets to Predict Future Bitcoin Price

Hathidara, Ashutosh, Atavale, Gaurav, Chaudhary, Suyash

arXiv.org Artificial IntelligenceDec-2-2024

Bitcoin has increased investment interests in people during the last decade. We have seen an increase in the number of posts on social media platforms about cryptocurrency, especially Bitcoin. This project focuses on analyzing user tweet data in combination with Bitcoin price data to see the relevance between price fluctuations and the conversation between millions of people on Twitter. This study also exploits this relationship between user tweets and bitcoin prices to predict the future bitcoin price. We are utilizing novel techniques and methods to analyze the data and make price predictions.

artificial intelligence, machine learning, tweet, (12 more...)

arXiv.org Artificial Intelligence

2412.02148

Country: North America > United States > Indiana (0.05)

Genre: Research Report (1.00)

Industry: Banking & Finance > Trading (1.00)

Technology:

Information Technology > e-Commerce > Financial Technology (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.99)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.71)

Add feedback