Goto

Collaborating Authors

 Regression


Scaling Law for Stochastic Gradient Descent in Quadratically Parameterized Linear Regression

arXiv.org Artificial Intelligence

In machine learning, the scaling law describes how the model performance improves with the model and data size scaling up. From a learning theory perspective, this class of results establishes upper and lower generalization bounds for a specific learning algorithm. Here, the exact algorithm running using a specific model parameterization often offers a crucial implicit regularization effect, leading to good generalization. To characterize the scaling law, previous theoretical studies mainly focus on linear models, whereas, feature learning, a notable process that contributes to the remarkable empirical success of neural networks, is regretfully vacant. This paper studies the scaling law over a linear regression with the model being quadratically parameterized. We consider infinitely dimensional data and slope ground truth, both signals exhibiting certain power-law decay rates. We study convergence rates for Stochastic Gradient Descent and demonstrate the learning rates for variables will automatically adapt to the ground truth. As a result, in the canonical linear regression, we provide explicit separations for generalization curves between SGD with and without feature learning, and the information-theoretical lower bound that is agnostic to parametrization method and the algorithm. Our analysis for decaying ground truth provides a new characterization for the learning dynamic of the model.


High-dimensional censored MIDAS logistic regression for corporate survival forecasting

arXiv.org Machine Learning

This paper addresses the challenge of forecasting corporate distress, a problem marked by three key statistical hurdles: (i) right censoring, (ii) high-dimensional predictors, and (iii) mixed-frequency data. To overcome these complexities, we introduce a novel high-dimensional censored MIDAS (Mixed Data Sampling) logistic regression. Our approach handles censoring through inverse probability weighting and achieves accurate estimation with numerous mixed-frequency predictors by employing a sparse-group penalty. We establish finite-sample bounds for the estimation error, accounting for censoring, the MIDAS approximation error, and heavy tails. The superior performance of the method is demonstrated through Monte Carlo simulations. Finally, we present an extensive application of our methodology to predict the financial distress of Chinese-listed firms. Our novel procedure is implemented in the R package 'Survivalml'.


Using Artificial Intelligence to Improve Classroom Learning Experience

arXiv.org Artificial Intelligence

Shadeeb Hossain Engineering Technology and Information Sciences DeVry University New York, USA [ORCID ID: 0000 - 0002 - 5224 - 7684 ] Abstract -- This paper explores advancements in Artificial Intelligence (AI) technologies to enhance classroom learning, highlighting contributions from companies like IBM, Microsoft, Google, and ChatGPT, as well as the potential of brain signal analysis. The focus is on improving students' learning experiences by using Machine Learning (ML) algorithms to (i) identify a student's preferred learning style (visual or auditory) and (ii) predict academic dropout risk. A Logistic Regression algorithm is applied for binary classification using six predictor variables, such as assessment scores, lesson duration, and preferred learning style, to accurately identify learning preferences. In comparison, the Stochastic Gradient Descent (SGD) classifier achieved an accuracy of 83.1% on the same dataset Individual feedback to students and customized learning materials has a significant impact on their learning ability and have been areas of active research focus [1]. However, in the United States, due to the vast diversity in classroom populations, it becomes inherently difficult for educators to customize lessons and address individual students' problems [2]. V arious factors contribute to the effectiveness of individual learning processes [3,4]. Questionnaires have often been used as a tool to predict an individual's learning style [5 - 8]. Learning analytics, which involves the collection, analysis, and use of da ta, has been suggested to improve students' learning experiences [9]. In most cases, these assessments have been used to generalize the overall learning patterns of a classroom rather than addressing the needs of individual students. The concept of a SMART classroom incorporates both hardware and software components to adapt to dynamic learning patterns in a classroom, and it has been an area of ongoing research [10,11].


Zero-shot Concept Bottleneck Models

arXiv.org Artificial Intelligence

Concept bottleneck models (CBMs) are inherently interpretable and intervenable neural network models, which explain their final label prediction by the intermediate prediction of high-level semantic concepts. However, they require target task training to learn input-to-concept and concept-to-label mappings, incurring target dataset collections and training resources. In this paper, we present \textit{zero-shot concept bottleneck models} (Z-CBMs), which predict concepts and labels in a fully zero-shot manner without training neural networks. Z-CBMs utilize a large-scale concept bank, which is composed of millions of vocabulary extracted from the web, to describe arbitrary input in various domains. For the input-to-concept mapping, we introduce concept retrieval, which dynamically finds input-related concepts by the cross-modal search on the concept bank. In the concept-to-label inference, we apply concept regression to select essential concepts from the retrieved concepts by sparse linear regression. Through extensive experiments, we confirm that our Z-CBMs provide interpretable and intervenable concepts without any additional training. Code will be available at https://github.com/yshinya6/zcbm.


Meta-learning of shared linear representations beyond well-specified linear regression

arXiv.org Machine Learning

Motivated by multi-task and meta-learning approaches, we consider the problem of learning structure shared by tasks or users, such as shared low-rank representations or clustered structures. While all previous works focus on well-specified linear regression, we consider more general convex objectives, where the structural low-rank and cluster assumptions are expressed on the optima of each function. We show that under mild assumptions such as \textit{Hessian concentration} and \textit{noise concentration at the optimum}, rank and clustered regularized estimators recover such structure, provided the number of samples per task and the number of tasks are large enough. We then study the problem of recovering the subspace in which all the solutions lie, in the setting where there is only a single sample per task: we show that in that case, the rank-constrained estimator can recover the subspace, but that the number of tasks needs to scale exponentially large with the dimension of the subspace. Finally, we provide a polynomial-time algorithm via nuclear norm constraints for learning a shared linear representation in the context of convex learning objectives.


Optimal Algorithms in Linear Regression under Covariate Shift: On the Importance of Precondition

arXiv.org Machine Learning

A common pursuit in modern statistical learning is to attain satisfactory generalization out of the source data distribution (OOD). In theory, the challenge remains unsolved even under the canonical setting of covariate shift for the linear model. This paper studies the foundational (high-dimensional) linear regression where the ground truth variables are confined to an ellipse-shape constraint and addresses two fundamental questions in this regime: (i) given the target covariate matrix, what is the min-max \emph{optimal} algorithm under covariate shift? (ii) for what kinds of target classes, the commonly-used SGD-type algorithms achieve optimality? Our analysis starts with establishing a tight lower generalization bound via a Bayesian Cramer-Rao inequality. For (i), we prove that the optimal estimator can be simply a certain linear transformation of the best estimator for the source distribution. Given the source and target matrices, we show that the transformation can be efficiently computed via a convex program. The min-max optimal analysis for SGD leverages the idea that we recognize both the accumulated updates of the applied algorithms and the ideal transformation as preconditions on the learning variables. We provide sufficient conditions when SGD with its acceleration variants attain optimality.


Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion

arXiv.org Artificial Intelligence

The growing availability of longitudinal Magnetic Resonance Imaging (MRI) datasets has facilitated Artificial Intelligence (AI)-driven modeling of disease progression, making it possible to predict future medical scans for individual patients. However, despite significant advancements in AI, current methods continue to face challenges including achieving patient-specific individualization, ensuring spatiotemporal consistency, efficiently utilizing longitudinal data, and managing the substantial memory demands of 3D scans. To address these challenges, we propose Brain Latent Progression (BrLP), a novel spatiotemporal model designed to predict individual-level disease progression in 3D brain MRIs. The key contributions in BrLP are fourfold: (i) it operates in a small latent space, mitigating the computational challenges posed by high-dimensional imaging data; (ii) it explicitly integrates subject metadata to enhance the individualization of predictions; (iii) it incorporates prior knowledge of disease dynamics through an auxiliary model, facilitating the integration of longitudinal data; and (iv) it introduces the Latent Average Stabilization (LAS) algorithm, which (a) enforces spatiotemporal consistency in the predicted progression at inference time and (b) allows us to derive a measure of the uncertainty for the prediction. We train and evaluate BrLP on 11,730 T1-weighted (T1w) brain MRIs from 2,805 subjects and validate its generalizability on an external test set comprising 2,257 MRIs from 962 subjects. Our experiments compare BrLP-generated MRI scans with real follow-up MRIs, demonstrating state-of-the-art accuracy compared to existing methods. The code is publicly available at: https://github.com/LemuelPuglisi/BrLP.


Mathematical Data Science

arXiv.org Artificial Intelligence

In this article we discuss an approach to doing this which one can call mathematical data science. In this paradigm, one studies mathematical objects collectively rather than individually, by creating datasets and doing machine learning experiments and interpretations. Broadly speaking, the field of data science is concerned with assembling, curating and analyzing large datasets, and developing methods which enable its users to not just answer predetermined questions about the data but to explore it, make simple descriptions and pictures, and arrive at novel insights. This certainly sounds promising as a tool for mathematical discovery! Mathematical data science is not new and has historically led to very important results. A famous example is the work of Birch and Swinnerton-Dyer leading to their conjecture [BSD65], based on computer generation of elliptic curves and linear regression analysis of the resulting data. However, the field really started to take off with the deep learning revolution and with the easy access to ML models provided by platforms such as Py-Torch and TensorFlow, and built into computer algebra systems such as Mathematica, Magma and SageMath.


Knowledge-Guided Wasserstein Distributionally Robust Optimization

arXiv.org Machine Learning

Transfer learning is a popular strategy to leverage external knowledge and improve statistical efficiency, particularly with a limited target sample. We propose a novel knowledge-guided Wasserstein Distributionally Robust Optimization (KG-WDRO) framework that adaptively incorporates multiple sources of external knowledge to overcome the conservativeness of vanilla WDRO, which often results in overly pessimistic shrinkage toward zero. Our method constructs smaller Wasserstein ambiguity sets by controlling the transportation along directions informed by the source knowledge. This strategy can alleviate perturbations on the predictive projection of the covariates and protect against information loss. Theoretically, we establish the equivalence between our WDRO formulation and the knowledge-guided shrinkage estimation based on collinear similarity, ensuring tractability and geometrizing the feasible set. This also reveals a novel and general interpretation for recent shrinkage-based transfer learning approaches from the perspective of distributional robustness. In addition, our framework can adjust for scaling differences in the regression models between the source and target and accommodates general types of regularization such as lasso and ridge. Extensive simulations demonstrate the superior performance and adaptivity of KG-WDRO in enhancing small-sample transfer learning.


Sparse Estimation of Inverse Covariance and Partial Correlation Matrices via Joint Partial Regression

arXiv.org Machine Learning

Two important and closely related problems in statistical learning are the problems of estimating a partial correlation network and the inverse covariance matrix, also known as the precision matrix, from data. Partial correlation networks, which generalize the Gaussian graphical model, are used to model the relationships between variables while conditioning on all other variables, and are useful for inferring causal relationships between variables. Partial correlation networks are used in a plethora of applications, such as in the analysis of gene expression data, where the goal is to infer the regulatory relationships between genes (de la Fuente et al., 2004), and psychological data, where networks are used to model the relationships between psychological variables such as mood and attitude (Epskamp and Fried, 2018). The precision matrix, from which we can obtain the partial correlation network, is also of interest in its own right, as it also appears in linear discriminant analysis (Hastie et al., 2009) and in Markowitz portfolio selection (Markowitz, 1952). However, due to the high-dimensionality of the problem, estimating a precision or partial correlation matrix is often challenging as the number of parameters are on the order of the squared number of features. For this reason, classical methods such as using the inverse of the sample covariance matrix, are known to perform poorly whenever the number of observation is not extremely large. Additionally they produce estimates which are almost surely dense. This makes regularization crucial, since in many applications we typically only have a moderate number of observations, and in particular we are most often seeking a sparse estimate which gives rise to a more parsimonious and interpretable network model.