Regression
Towards a Foundation Model for Brain Age Prediction using coVariance Neural Networks
Sihag, Saurabh, Mateos, Gonzalo, Ribeiro, Alejandro
Brain age is the estimate of biological age derived from neuroimaging datasets using machine learning algorithms. Increasing brain age with respect to chronological age can reflect increased vulnerability to neurodegeneration and cognitive decline. In this paper, we study NeuroVNN, based on coVariance neural networks, as a paradigm for foundation model for the brain age prediction application. NeuroVNN is pre-trained as a regression model on healthy population to predict chronological age using cortical thickness features and fine-tuned to estimate brain age in different neurological contexts. Importantly, NeuroVNN adds anatomical interpretability to brain age and has a `scale-free' characteristic that allows its transference to datasets curated according to any arbitrary brain atlas. Our results demonstrate that NeuroVNN can extract biologically plausible brain age estimates in different populations, as well as transfer successfully to datasets of dimensionalities distinct from that for the dataset used to train NeuroVNN.
Clustering Dynamics for Improved Speed Prediction Deriving from Topographical GPS Registrations
Carneiro, Sarah Almeida, Chierchia, Giovanni, Pirayre, Aurelie, Najman, Laurent
A persistent challenge in the field of Intelligent Transportation Systems is to extract accurate traffic insights from geographic regions with scarce or no data coverage. To this end, we propose solutions for speed prediction using sparse GPS data points and their associated topographical and road design features. Our goal is to investigate whether we can use similarities in the terrain and infrastructure to train a machine learning model that can predict speed in regions where we lack transportation data. For this we create a Temporally Orientated Speed Dictionary Centered on Topographically Clustered Roads, which helps us to provide speed correlations to selected feature configurations. Our results show qualitative and quantitative improvement over new and standard regression methods. The presented framework provides a fresh perspective on devising strategies for missing data traffic analysis.
The Limits of Assumption-free Tests for Algorithm Performance
Luo, Yuetian, Barber, Rina Foygel
Algorithm evaluation and comparison are fundamental questions in machine learning and statistics -- how well does an algorithm perform at a given modeling task, and which algorithm performs best? Many methods have been developed to assess algorithm performance, often based around cross-validation type strategies, retraining the algorithm of interest on different subsets of the data and assessing its performance on the held-out data points. Despite the broad use of such procedures, the theoretical properties of these methods are not yet fully understood. In this work, we explore some fundamental limits for answering these questions with limited amounts of data. In particular, we make a distinction between two questions: how good is an algorithm $A$ at the problem of learning from a training set of size $n$, versus, how good is a particular fitted model produced by running $A$ on a particular training data set of size $n$? Our main results prove that, for any test that treats the algorithm $A$ as a ``black box'' (i.e., we can only study the behavior of $A$ empirically), there is a fundamental limit on our ability to carry out inference on the performance of $A$, unless the number of available data points $N$ is many times larger than the sample size $n$ of interest. (On the other hand, evaluating the performance of a particular fitted model is easy as long as a holdout data set is available -- that is, as long as $N-n$ is not too small.) We also ask whether an assumption of algorithmic stability might be sufficient to circumvent this hardness result. Surprisingly, we find that this is not the case: the same hardness result still holds for the problem of evaluating the performance of $A$, aside from a high-stability regime where fitted models are essentially nonrandom. Finally, we also establish similar hardness results for the problem of comparing multiple algorithms.
Resampling methods for Private Statistical Inference
Chadha, Karan, Duchi, John, Kuditipudi, Rohit
Releasing statistics using sensitive data can hurt the privacy of individuals contributing to the data (Narayanan and Shmatikov, 2008; Dick et al., 2023). Differential privacy (Dwork et al., 2006) is now a widely accepted solution for performing statistical analysis while protecting sensitive data. In the years since its release, researchers have made considerable progress in the development of differentially private estimators for a range of statistical problems such as mean estimation, median estimation, logistic regression (Asi and Duchi, 2020; Chaudhuri et al., 2011). However, deriving a conclusion from a single point estimate--whether an empirical mean or a classifier prediction-- without any consideration of uncertainty can lead to faulty, inaccurate decision-making (Gelman and Loken, 2013). To have any hope of making private statistical tools broadly applicable, we must build the requisite inferential tools. Constructing confidence intervals around a give point estimate is the most basic inferential task. We therefore develop tools to do so for a broad class of statistics of interest with differential privacy.
Optimizing Uterine Synchronization Analysis in Pregnancy and Labor through Window Selection and Node Optimization
Dine, Kamil Bader El, Nader, Noujoud, Khalil, Mohamad, Marque, Catherine
Preterm labor (PL) has globally become the leading cause of death in children under the age of 5 years. To address this problem, this paper will provide a new approach by analyzing the EHG signals, which are recorded on the abdomen of the mother during labor and pregnancy. The EHG signal reflects the electrical activity that induces the mechanical contraction of the myometrium. Because EHGs are known to be non-stationary signals, and because we anticipate connectivity to alter during contraction, we applied the windowing approach on real signals to help us identify the best windows and the best nodes with the most significant data to be used for classification. The suggested pipeline includes i) divide the 16 EHG signals that are recorded from the abdomen of pregnant women in N windows; ii) apply the connectivity matrices on each window; iii) apply the Graph theory-based measures on the connectivity matrices on each window; iv) apply the consensus Matrix on each window in order to retrieve the best windows and the best nodes. Following that, several neural network and machine learning methods are applied to the best windows and best nodes to categorize pregnancy and labor contractions, based on the different input parameters (connectivity method alone, connectivity method plus graph parameters, best nodes, all nodes, best windows, all windows). Results showed that the best nodes are nodes 8, 9, 10, 11, and 12; while the best windows are 2, 4, and 5. The classification results obtained by using only these best nodes are better than when using the whole nodes. The results are always better when using the full burst, whatever the chosen nodes. Thus, the windowing approach proved to be an innovative technique that can improve the differentiation between labor and pregnancy EHG signals.
Logistic-beta processes for modeling dependent random probabilities with beta marginals
Lee, Changwoo J., Zito, Alessandro, Sang, Huiyan, Dunson, David B.
The beta distribution serves as a canonical tool for modeling probabilities and is extensively used in statistics and machine learning, especially in the field of Bayesian nonparametrics. Despite its widespread use, there is limited work on flexible and computationally convenient stochastic process extensions for modeling dependent random probabilities. We propose a novel stochastic process called the logistic-beta process, whose logistic transformation yields a stochastic process with common beta marginals. Similar to the Gaussian process, the logistic-beta process can model dependence on both discrete and continuous domains, such as space or time, and has a highly flexible dependence structure through correlation kernels. Moreover, its normal variance-mean mixture representation leads to highly effective posterior inference algorithms. The flexibility and computational benefits of logistic-beta processes are demonstrated through nonparametric binary regression simulation studies. Furthermore, we apply the logistic-beta process in modeling dependent Dirichlet processes, and illustrate its application and benefits through Bayesian density regression problems in a toxicology study.
Contextual Stochastic Vehicle Routing with Time Windows
Serrano, Breno, Florio, Alexandre M., Minner, Stefan, Schiffer, Maximilian, Vidal, Thibaut
We study the vehicle routing problem with time windows (VRPTW) and stochastic travel times, in which the decision-maker observes related contextual information, represented as feature variables, before making routing decisions. Despite the extensive literature on stochastic VRPs, the integration of feature variables has received limited attention in this context. We introduce the contextual stochastic VRPTW, which minimizes the total transportation cost and expected late arrival penalties conditioned on the observed features. Since the joint distribution of travel times and features is unknown, we present novel data-driven prescriptive models that use historical data to provide an approximate solution to the problem. We distinguish the prescriptive models between point-based approximation, sample average approximation, and penalty-based approximation, each taking a different perspective on dealing with stochastic travel times and features. We develop specialized branch-price-and-cut algorithms to solve these data-driven prescriptive models. In our computational experiments, we compare the out-of-sample cost performance of different methods on instances with up to one hundred customers. Our results show that, surprisingly, a feature-dependent sample average approximation outperforms existing and novel methods in most settings.
Clustering Techniques Selection for a Hybrid Regression Model: A Case Study Based on a Solar Thermal System
García-Ordás, María Teresa, Alaiz-Moretón, Héctor, Casteleiro-Roca, José-Luis, Jove, Esteban, Benítez-Andrades, José Alberto, García-Rodríguez, Isaías, Quintián, Héctor, Calvo-Rolle, José Luis
This work addresses the performance comparison between four clustering techniques with the objective of achieving strong hybrid models in supervised learning tasks. A real dataset from a bio-climatic house named Sotavento placed on experimental wind farm and located in Xermade (Lugo) in Galicia (Spain) has been collected. Authors have chosen the thermal solar generation system in order to study how works applying several cluster methods followed by a regression technique to predict the output temperature of the system. With the objective of defining the quality of each clustering method two possible solutions have been implemented. The first one is based on three unsupervised learning metrics (Silhouette, Calinski-Harabasz and Davies-Bouldin) while the second one, employs the most common error measurements for a regression algorithm such as Multi Layer Perceptron.
Low-Rank Approximation of Structural Redundancy for Self-Supervised Learning
We study the data-generating mechanism for reconstructive SSL to shed light on its effectiveness. With an infinite amount of labeled samples, we provide a sufficient and necessary condition for perfect linear approximation. The condition reveals a full-rank component that preserves the label classes of Y, along with a redundant component. Motivated by the condition, we propose to approximate the redundant component by a low-rank factorization and measure the approximation quality by introducing a new quantity $\epsilon_s$, parameterized by the rank of factorization s. We incorporate $\epsilon_s$ into the excess risk analysis under both linear regression and ridge regression settings, where the latter regularization approach is to handle scenarios when the dimension of the learned features is much larger than the number of labeled samples n for downstream tasks. We design three stylized experiments to compare SSL with supervised learning under different settings to support our theoretical findings.
Scalable Kernel Logistic Regression with Nystr\"om Approximation: Theoretical Analysis and Application to Discrete Choice Modelling
Martín-Baos, José Ángel, García-Ródenas, Ricardo, Rodriguez-Benitez, Luis, Bierlaire, Michel
The application of kernel-based Machine Learning (ML) techniques to discrete choice modelling using large datasets often faces challenges due to memory requirements and the considerable number of parameters involved in these models. This complexity hampers the efficient training of large-scale models. This paper addresses these problems of scalability by introducing the Nystr\"om approximation for Kernel Logistic Regression (KLR) on large datasets. The study begins by presenting a theoretical analysis in which: i) the set of KLR solutions is characterised, ii) an upper bound to the solution of KLR with Nystr\"om approximation is provided, and finally iii) a specialisation of the optimisation algorithms to Nystr\"om KLR is described. After this, the Nystr\"om KLR is computationally validated. Four landmark selection methods are tested, including basic uniform sampling, a k-means sampling strategy, and two non-uniform methods grounded in leverage scores. The performance of these strategies is evaluated using large-scale transport mode choice datasets and is compared with traditional methods such as Multinomial Logit (MNL) and contemporary ML techniques. The study also assesses the efficiency of various optimisation techniques for the proposed Nystr\"om KLR model. The performance of gradient descent, Momentum, Adam, and L-BFGS-B optimisation methods is examined on these datasets. Among these strategies, the k-means Nystr\"om KLR approach emerges as a successful solution for applying KLR to large datasets, particularly when combined with the L-BFGS-B and Adam optimisation methods. The results highlight the ability of this strategy to handle datasets exceeding 200,000 observations while maintaining robust performance.