Goto

Collaborating Authors

 Regression


A successive approximation method in functional spaces for hierarchical optimal control problems and its application to learning

arXiv.org Machine Learning

We consider a class of learning problem of point estimation for modeling high-dimensional nonlinear functions, whose learning dynamics is guided by model training dataset, while the estimated parameter in due course provides an acceptable prediction accuracy on a different model validation dataset. Here, we establish an evidential connection between such a learning problem and a hierarchical optimal control problem that provides a framework how to account appropriately for both generalization and regularization at the optimization stage. In particular, we consider the following two objectives: (i) The first one is a controllability-type problem, i.e., generalization, which consists of guaranteeing the estimated parameter to reach a certain target set at some fixed final time, where such a target set is associated with model validation dataset. (ii) The second one is a regularization-type problem ensuring the estimated parameter trajectory to satisfy some regularization property over a certain finite time interval. First, we partition the control into two control strategies that are compatible with two abstract agents, namely, a leader, which is responsible for the controllability-type problem and that of a follower, which is associated with the regularization-type problem. Using the notion of Stackelberg's optimization, we provide conditions on the existence of admissible optimal controls for such a hierarchical optimal control problem under which the follower is required to respond optimally to the strategy of the leader, so as to achieve the overall objectives that ultimately leading to an optimal parameter estimate. Moreover, we provide a nested algorithm, arranged in a hierarchical structure-based on successive approximation methods, for solving the corresponding optimal control problem. Finally, we present some numerical results for a typical nonlinear regression problem.


On the Gaussian process limit of Bayesian Additive Regression Trees

arXiv.org Machine Learning

Bayesian Additive Regression Trees (BART) is a nonparametric Bayesian regression technique of rising fame. It is a sum-of-decision-trees model, and is in some sense the Bayesian version of boosting. In the limit of infinite trees, it becomes equivalent to Gaussian process (GP) regression. This limit is known but has not yet led to any useful analysis or application. For the first time, I derive and compute the exact BART prior covariance function. With it I implement the infinite trees limit of BART as GP regression. Through empirical tests, I show that this limit is worse than standard BART in a fixed configuration, but also that tuning the hyperparameters in the natural GP way yields a competitive method, although a properly tuned BART is still superior. The advantage of using a GP surrogate of BART is the analytical likelihood, which simplifies model building and sidesteps the complex BART MCMC. More generally, this study opens new ways to understand and develop BART and GP regression. The implementation of BART as GP is available in the Python package https://github.com/Gattocrucco/lsqfitgp .


Classification under strategic adversary manipulation using pessimistic bilevel optimisation

arXiv.org Artificial Intelligence

Adversarial machine learning concerns situations in which learners face attacks from active adversaries. Such scenarios arise in applications such as spam email filtering, malware detection and fake-image generation, where security methods must be actively updated to keep up with the ever improving generation of malicious data. We model these interactions between the learner and the adversary as a game and formulate the problem as a pessimistic bilevel optimisation problem with the learner taking the role of the leader. The adversary, modelled as a stochastic data generator, takes the role of the follower, generating data in response to the classifier. While existing models rely on the assumption that the adversary will choose the least costly solution leading to a convex lower-level problem with a unique solution, we present a novel model and solution method which do not make such assumptions. We compare these to the existing approach and see significant improvements in performance suggesting that relaxing these assumptions leads to a more realistic model.


Predicting Mortality and Functional Status Scores of Traumatic Brain Injury Patients using Supervised Machine Learning

arXiv.org Artificial Intelligence

Traumatic brain injury (TBI) presents a significant public health challenge, often resulting in mortality or lasting disability. Predicting outcomes such as mortality and Functional Status Scale (FSS) scores can enhance treatment strategies and inform clinical decision-making. This study applies supervised machine learning (ML) methods to predict mortality and FSS scores using a real-world dataset of 300 pediatric TBI patients from the University of Colorado School of Medicine. The dataset captures clinical features, including demographics, injury mechanisms, and hospitalization outcomes. Eighteen ML models were evaluated for mortality prediction, and thirteen models were assessed for FSS score prediction. Performance was measured using accuracy, ROC AUC, F1-score, and mean squared error. Logistic regression and Extra Trees models achieved high precision in mortality prediction, while linear regression demonstrated the best FSS score prediction. Feature selection reduced 103 clinical variables to the most relevant, enhancing model efficiency and interpretability. This research highlights the role of ML models in identifying high-risk patients and supporting personalized interventions, demonstrating the potential of data-driven analytics to improve TBI care and integrate into clinical workflows.


Statistical Inference in Classification of High-dimensional Gaussian Mixture

arXiv.org Machine Learning

We consider the classification problem of a high-dimensional mixture of two Gaussians with general covariance matrices. Using the replica method from statistical physics, we investigate the asymptotic behavior of a general class of regularized convex classifiers in the high-dimensional limit, where both the sample size $n$ and the dimension $p$ approach infinity while their ratio $\alpha=n/p$ remains fixed. Our focus is on the generalization error and variable selection properties of the estimators. Specifically, based on the distributional limit of the classifier, we construct a de-biased estimator to perform variable selection through an appropriate hypothesis testing procedure. Using $L_1$-regularized logistic regression as an example, we conducted extensive computational experiments to confirm that our analytical findings are consistent with numerical simulations in finite-sized systems. We also explore the influence of the covariance structure on the performance of the de-biased estimator.


Adjusted Overfitting Regression

arXiv.org Artificial Intelligence

Abstract: In this paper, I will introduce a new form of regression, that can adjust overfitting and underfitting through, "distance-based regression". Overfitting often results in finding false patterns causing inaccurate results, so by having a new approach that minimizes overfitting, more accurate predictions can be derived. Then I will proceed with a test of my regression form and show additional ways to optimize the regression. Finally, I will apply my new technique to a specific data set to demonstrate its practical value. CONTENTS Introduction 1. Distance and X-axis Based Regression 1.1 X-Axis Based Regression 1.2 Distance Based Regression 2. Weighted Regression 2.1 Division "Weighted Cost Functions" 2.2 Other "Weighted Cost Functions" 2.3 Randomness and change adjusted "Weighted Cost Functions" 3. Applications and Tests 3.1 Testing on Different Data sets References Index Wilson 2 Introduction In this paper I will introduce a new form of regression, "Overfitting Based Regression" which allows you to tune the level of overfitting or underfitting, with the goal of generalizing standard regression methods. This new regression technique produces a nonlinear function of the x or right hand side variables using weights on neighboring data points, instead of the traditional approach of applying the best fit line.


Distance and Kernel-Based Measures for Global and Local Two-Sample Conditional Distribution Testing

arXiv.org Machine Learning

Testing the equality of two conditional distributions is crucial in various modern applications, including transfer learning and causal inference. Despite its importance, this fundamental problem has received surprisingly little attention in the literature. This work aims to present a unified framework based on distance and kernel methods for both global and local two-sample conditional distribution testing. To this end, we introduce distance and kernel-based measures that characterize the homogeneity of two conditional distributions. Drawing from the concept of conditional U-statistics, we propose consistent estimators for these measures. Theoretically, we derive the convergence rates and the asymptotic distributions of the estimators under both the null and alternative hypotheses. Utilizing these measures, along with a local bootstrap approach, we develop global and local tests that can detect discrepancies between two conditional distributions at global and local levels, respectively. Our tests demonstrate reliable performance through simulations and real data analyses.


Retrieving snow depth distribution by downscaling ERA5 Reanalysis with ICESat-2 laser altimetry

arXiv.org Artificial Intelligence

Estimating the variability of seasonal snow cover, in particular snow depth in remote areas, poses significant challenges due to limited spatial and temporal data availability. This study uses snow depth measurements from the ICESat-2 satellite laser altimeter, which are sparse in both space and time, and incorporates them with climate reanalysis data into a downscaling-calibration scheme to produce monthly gridded snow depth maps at microscale (10 m). Snow surface elevation measurements from ICESat-2 along profiles are compared to a digital elevation model to determine snow depth at each point. To efficiently turn sparse measurements into snow depth maps, a regression model is fitted to establish a relationship between the retrieved snow depth and the corresponding ERA5 Land snow depth. This relationship, referred to as subgrid variability, is then applied to downscale the monthly ERA5 Land snow depth data. The method can provide timeseries of monthly snow depth maps for the entire ERA5 time range (since 1950). The validation of downscaled snow depth data was performed at an intermediate scale (100 m x 500 m) using datasets from airborne laser scanning (ALS) in the Hardangervidda region of southern Norway. Results show that snow depth prediction achieved R2 values ranging from 0.74 to 0.88 (post-calibration). The method relies on globally available data and is applicable to other snow regions above the treeline. Though requiring area-specific calibration, our approach has the potential to provide snow depth maps in areas where no such data exist and can be used to extrapolate existing snow surveys in time and over larger areas. With this, it can offer valuable input data for hydrological, ecological or permafrost modeling tasks.


Robust Variable Selection for High-dimensional Regression with Missing Data and Measurement Errors

arXiv.org Machine Learning

The linear relationship between response variables and covariates has been the topic of interest.In the classical squared loss function,it is usually assumed that the data obey a normal distribution.However,the data discussed in this paper contain a large number of missing data and measurement errors,such that the datausually do not conform to any of the common forms of data distribution.We propose a method based on an exponential squared loss function with tuning parameter.For data with different distributions,a better result of linear regression can be achieved by changing the value of the tuning parameter h.Therefore,forany kind of data distribution,going with an exponential squared loss function with moderating variables will be highly robust.For any data distribution,the loss function is strongly robust for h (0,+x).In previous studies,when using the traditional squared loss function,the data distribution requirements are very high,resulting in the traditional exponential squared loss function being very sensitive to anomalies.This reduces the estimation efficiency of the model,and this drawback becomes more obvious in data containing missing data with measurement errors.In contrast,the use of exponential squared loss functions can improve the estimation efficiency of the model by varying thetuning parameter h in a way that adapts to more distributed forms of data sets and produces more reliable estimates. In the traditional squared loss function,the values of the covariates are always defaulted to be free ofmissingdata and measurement errors.Even if missing data and measurement errors exist,they are assumed to be absent or these data are removed.However,this assumption is often broken in studies in disciplines such as health and epidemiology.As an illustration,Zhang and Zhou(1)looked at a collection of breast cancer patients to identify the gene expression that was associated with long-term disease-free survival.The datacollection consists of 24481 gene probes collected from 78 breast cancer patients.In particular,using the log-value of the ratio (log1o(Ratio)),which could be denoted as Y,it is possible to forecast the disease-free survival.In truth,gene sensors will inevitably lead to measurement errors.In this breast cancer data set,the(log1o(Ratio))numbers have missing data. When there are a large numberof missing data and measurement errors in a dataset,if we ignore the missing data and measurement errors and use the traditional square loss function for estimation,the estimation accuracy of the model will be greatly affected due to the chaotic data distribution,resulting in significant estimation bias.In the above dataset, We discover that employing the traditional squared loss function,which handles data with measurement errors and Robust Variable Selection for High-dimensional Regression with Missing Data and Measurement Errors


Self-Supervised Learning for Time Series: A Review & Critique of FITS

arXiv.org Artificial Intelligence

Accurate time series forecasting is a highly valuable endeavour with applications across many industries. Despite recent deep learning advancements, increased model complexity, and larger model sizes, many state-of-the-art models often perform worse or on par with simpler models. One of those cases is a recently proposed model, FITS, claiming competitive performance with significantly reduced parameter counts. By training a one-layer neural network in the complex frequency domain, we are able to replicate these results. Our experiments on a wide range of real-world datasets further reveal that FITS especially excels at capturing periodic and seasonal patterns, but struggles with trending, non-periodic, or random-resembling behavior. With our two novel hybrid approaches, where we attempt to remedy the weaknesses of FITS by combining it with DLinear, we achieve the best results of any known open-source model on multivariate regression and promising results in multiple/linear regression on price datasets, on top of vastly improving upon what FITS achieves as a standalone model.