Goto

Collaborating Authors

 Regression


Understanding SGD with Exponential Moving Average: A Case Study in Linear Regression

arXiv.org Machine Learning

Exponential moving average (EMA) has recently gained significant popularity in training modern deep learning models, especially diffusion-based generative models. However, there have been few theoretical results explaining the effectiveness of EMA. In this paper, to better understand EMA, we establish the risk bound of online SGD with EMA for high-dimensional linear regression, one of the simplest overparameterized learning tasks that shares similarities with neural networks. Our results indicate that (i) the variance error of SGD with EMA is always smaller than that of SGD without averaging, and (ii) unlike SGD with iterate averaging from the beginning, the bias error of SGD with EMA decays exponentially in every eigen-subspace of the data covariance matrix. Additionally, we develop proof techniques applicable to the analysis of a broad class of averaging schemes.


OGBoost: A Python Package for Ordinal Gradient Boosting

arXiv.org Machine Learning

This paper introduces OGBoost, a scikit-learn-compatible Python package for ordinal regression using gradient boosting. Ordinal variables (e.g., rating scales, quality assessments) lie between nominal and continuous data, necessitating specialized methods that reflect their inherent ordering. Built on a coordinate-descent approach for optimization and the latent-variable framework for ordinal regression, OGBoost performs joint optimization of a latent continuous regression function (functional gradient descent) and a threshold vector that converts the latent continuous value into discrete class probabilities (classical gradient descent). In addition to the stanadard methods for scikit-learn classifiers, the GradientBoostingOrdinal class implements a "decision_function" that returns the (scalar) value of the latent function for each observation, which can be used as a high-resolution alternative to class labels for comparing and ranking observations. The class has the option to use cross-validation for early stopping rather than a single holdout validation set, a more robust approach for small and/or imbalanced datasets. Furthermore, users can select base learners with different underlying algorithms and/or hyperparameters for use throughout the boosting iterations, resulting in a `heterogeneous' ensemble approach that can be used as a more efficient alternative to hyperparameter tuning (e.g. via grid search). We illustrate the capabilities of OGBoost through examples, using the wine quality dataset from the UCI respository. The package is available on PyPI and can be installed via "pip install ogboost".


Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression

arXiv.org Machine Learning

In overparameterized logistic regression, gradient descent (GD) iterates diverge in norm while converging in direction to the maximum $\ell_2$-margin solution -- a phenomenon known as the implicit bias of GD. This work investigates additional regularization effects induced by early stopping in well-specified high-dimensional logistic regression. We first demonstrate that the excess logistic risk vanishes for early-stopped GD but diverges to infinity for GD iterates at convergence. This suggests that early-stopped GD is well-calibrated, whereas asymptotic GD is statistically inconsistent. Second, we show that to attain a small excess zero-one risk, polynomially many samples are sufficient for early-stopped GD, while exponentially many samples are necessary for any interpolating estimator, including asymptotic GD. This separation underscores the statistical benefits of early stopping in the overparameterized regime. Finally, we establish nonasymptotic bounds on the norm and angular differences between early-stopped GD and $\ell_2$-regularized empirical risk minimizer, thereby connecting the implicit regularization of GD with explicit $\ell_2$-regularization.


Discovering the influence of personal features in psychological processes using Artificial Intelligence techniques: the case of COVID19 lockdown in Spain

arXiv.org Artificial Intelligence

At the end of 2019, an outbreak of a novel coronavirus was reported in China, leading to the COVID-19 pandemic. In Spain, the first cases were detected in late January 2020, and by mid-March, infections had surpassed 5,000. On March the Spanish government started a nationwide lockdown to contain the spread of the virus. While isolation measures were necessary, they posed significant psychological and socioeconomic challenges, particularly for vulnerable populations. Understanding the psychological impact of lockdown and the factors influencing mental health is crucial for informing future public health policies. This study analyzes the influence of personal, socioeconomic, general health and living condition factors on psychological states during lockdown using AI techniques. A dataset collected through an online questionnaire was processed using two workflows, each structured into three stages. First, individuals were categorized based on psychological assessments, either directly or in combination with unsupervised learning techniques. Second, various Machine Learning classifiers were trained to distinguish between the identified groups. Finally, feature importance analysis was conducted to identify the most influential variables related to different psychological conditions. The evaluated models demonstrated strong performance, with accuracy exceeding 80% and often surpassing 90%, particularly for Random Forest, Decision Trees, and Support Vector Machines. Sensitivity and specificity analyses revealed that models performed well across different psychological conditions, with the health impacts subset showing the highest reliability. For diagnosing vulnerability, models achieved over 90% accuracy, except for less vulnerable individuals using living environment and economic status features, where performance was slightly lower.


Task Shift: From Classification to Regression in Overparameterized Linear Models

arXiv.org Machine Learning

Modern machine learning methods have recently demonstrated remarkable capability to generalize under task shift, where latent knowledge is transferred to a different, often more difficult, task under a similar data distribution. We investigate this phenomenon in an overparameterized linear regression setting where the task shifts from classification during training to regression during evaluation. In the zero-shot case, wherein no regression data is available, we prove that task shift is impossible in both sparse signal and random signal models for any Gaussian covariate distribution. In the few-shot case, wherein limited regression data is available, we propose a simple postprocessing algorithm which asymptotically recovers the ground-truth predictor. Our analysis leverages a fine-grained characterization of individual parameters arising from minimum-norm interpolation which may be of independent interest. Our results show that while minimum-norm interpolators for classification cannot transfer to regression a priori, they experience surprisingly structured attenuation which enables successful task shift with limited additional data.


The Relationship Between Head Injury and Alzheimer's Disease: A Causal Analysis with Bayesian Networks

arXiv.org Artificial Intelligence

This study examines the potential causal relationship between head injury and the risk of developing Alzheimer's disease (AD) using Bayesian networks and regression models. Using a dataset of 2,149 patients, we analyze key medical history variables, including head injury history, memory complaints, cardiovascular disease, and diabetes. Logistic regression results suggest an odds ratio of 0.88 for head injury, indicating a potential but statistically insignificant protective effect against AD. In contrast, memory complaints exhibit a strong association with AD, with an odds ratio of 4.59. Linear regression analysis further confirms the lack of statistical significance for head injury (coefficient: -0.0245, p = 0.469) while reinforcing the predictive importance of memory complaints. These findings highlight the complex interplay of medical history factors in AD risk assessment and underscore the need for further research utilizing larger datasets and advanced causal modeling techniques.


Unsupervised Anomaly Detection through Mass Repulsing Optimal Transport

arXiv.org Machine Learning

An anomaly, or an outlier, is a data point that is significantly different from the remaining data [Aggarwal, 2017], to such an extent that it was likely generated by a different mechanism [Hawkins, 1980]. From the perspective of machine learning, Anomaly Detection (AD) wants to determine, from a set of examples, which ones are likely anomalies, typically through a score. This problem finds applications in many different fields, such as medicine Salem et al. [2013], cyber-security Siddiqui et al. [2019], and system monitoring Isermann [2006], to name a few. As reviewed in Han et al. [2022], existing techniques for AD are usually divided into unsupervised, semi-supervised and supervised approaches, with an increasing need for labeled data. In this paper, we focus on unsupervised AD, which does not need further labeling effort in constituting datasets. As discussed in Livernoche et al. [2024], the growing number of applications involving high-dimensional and complex data begs the need for non-parametric algorithms.


Asymptotic Optimism of Random-Design Linear and Kernel Regression Models

arXiv.org Machine Learning

We derived the closed-form asymptotic optimism of linear regression models under random designs, and generalizes it to kernel ridge regression. Using scaled asymptotic optimism as a generic predictive model complexity measure, we studied the fundamental different behaviors of linear regression model, tangent kernel (NTK) regression model and three-layer fully connected neural networks (NN). Our contribution is two-fold: we provided theoretical ground for using scaled optimism as a model predictive complexity measure; and we show empirically that NN with ReLUs behaves differently from kernel models under this measure. With resampling techniques, we can also compute the optimism for regression models with real data.


How does ion temperature gradient turbulence depend on magnetic geometry? Insights from data and machine learning

arXiv.org Artificial Intelligence

Magnetic geometry has a significant effect on the level of turbulent transport in fusion plasmas. Here, we model and analyze this dependence using multiple machine learning methods and a dataset of > 200,000 nonlinear simulations of ion-temperature-gradient turbulence in diverse non-axisymmetric geometries. The dataset is generated using a large collection of both optimized and randomly generated stellarator equilibria. At fixed gradients, the turbulent heat flux varies between geometries by several orders of magnitude. Trends are apparent among the configurations with particularly high or low heat flux. Regression and classification techniques from machine learning are then applied to extract patterns in the dataset. Due to a symmetry of the gyrokinetic equation, the heat flux and regressions thereof should be invariant to translations of the raw features in the parallel coordinate, similar to translation invariance in computer vision applications. Multiple regression models including convolutional neural networks (CNNs) and decision trees can achieve reasonable predictive power for the heat flux in held-out test configurations, with highest accuracy for the CNNs. Using Spearman correlation, sequential feature selection, and Shapley values to measure feature importance, it is consistently found that the most important geometric lever on the heat flux is the flux surface compression in regions of bad curvature. The second most important feature relates to the magnitude of geodesic curvature. These two features align remarkably with surrogates that have been proposed based on theory, while the methods here allow a natural extension to more features for increased accuracy. The dataset, released with this publication, may also be used to test other proposed surrogates, and we find many previously published proxies do correlate well with both the heat flux and stability boundary.


Text Classification in the LLM Era - Where do we stand?

arXiv.org Artificial Intelligence

Large Language Models revolutionized NLP and showed dramatic performance improvements across several tasks. In this paper, we investigated the role of such language models in text classification and how they compare with other approaches relying on smaller pre-trained language models. Considering 32 datasets spanning 8 languages, we compared zero-shot classification, few-shot fine-tuning and synthetic data based classifiers with classifiers built using the complete human labeled dataset. Our results show that zero-shot approaches do well for sentiment classification, but are outperformed by other approaches for the rest of the tasks, and synthetic data sourced from multiple LLMs can build better classifiers than zero-shot open LLMs. We also see wide performance disparities across languages in all the classification scenarios. We expect that these findings would guide practitioners working on developing text classification systems across languages.