Goto

Collaborating Authors

 heteroscedasticity




Supplementary Material of Absolute Neighbour Difference based Correlation T est for Detecting Heteroscedastic Relationships

Neural Information Processing Systems

According to the Cauchy Schwarz inequality, it should also have a value between 1. 2 Second, consider the numerator of (7). For M > 2, it can be proved in the same way as above. White test was set to be 15, otherwise it may fail to detect the heteroscedasticity of the residuals. This is because when 99% or even 99.9% of the variance of Four existing association measures were also implemented for make comparisons with the proposed method. These approaches are typical and well-established.



Conformalized Regression for Continuous Bounded Outcomes

Wu, Zhanli, Leisen, Fabrizio, Rubio, F. Javier

arXiv.org Machine Learning

Regression problems with bounded continuous outcomes frequently arise in real-world statistical and machine learning applications, such as the analysis of rates and proportions. A central challenge in this setting is predicting a response associated with a new covariate value. Most of the existing statistical and machine learning literature has focused either on point prediction of bounded outcomes or on interval prediction based on asymptotic approximations. We develop conformal prediction intervals for bounded outcomes based on transformation models and beta regression. We introduce tailored non-conformity measures based on residuals that are aligned with the underlying models, and account for the inherent heteroscedasticity in regression settings with bounded outcomes. We present a theoretical result on asymptotic marginal and conditional validity in the context of full conformal prediction, which remains valid under model misspecification. For split conformal prediction, we provide an empirical coverage analysis based on a comprehensive simulation study. The simulation study demonstrates that both methods provide valid finite-sample predictive coverage, including settings with model misspecification. Finally, we demonstrate the practical performance of the proposed conformal prediction intervals on real data and compare them with bootstrap-based alternatives.


ALPCAH: Subspace Learning for Sample-wise Heteroscedastic Data

Cavazos, Javier Salazar, Fessler, Jeffrey A., Balzano, Laura

arXiv.org Machine Learning

Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction. However, some applications involve heterogeneous data that vary in quality due to noise characteristics associated with each data sample. Heteroscedastic methods aim to deal with such mixed data quality. This paper develops a subspace learning method, named ALPCAH, that can estimate the sample-wise noise variances and use this information to improve the estimate of the subspace basis associated with the low-rank structure of the data. Our method makes no distributional assumptions of the low-rank component and does not assume that the noise variances are known. Further, this method uses a soft rank constraint that does not require subspace dimension to be known. Additionally, this paper develops a matrix factorized version of ALPCAH, named LR-ALPCAH, that is much faster and more memory efficient at the cost of requiring subspace dimension to be known or estimated. Simulations and real data experiments show the effectiveness of accounting for data heteroscedasticity compared to existing algorithms. Code available at https://github.com/javiersc1/ALPCAH.


H-AddiVortes: Heteroscedastic (Bayesian) Additive Voronoi Tessellations

Stone, Adam J., Gosling, John Paul

arXiv.org Machine Learning

This paper introduces the Heteroscedastic AddiVortes model, a Bayesian non-parametric regression framework that simultaneously models the conditional mean and variance of a response variable using adaptive Voronoi tessellations. By employing a sum-of-tessellations approach for the mean and a product-of-tessellations approach for the variance, the model provides a flexible and interpretable means to capture complex, predictor-dependent relationships and heteroscedastic patterns in data. This dual-layer representation enables precise inference, even in high-dimensional settings, while maintaining computational feasibility through efficient Markov Chain Monte Carlo (MCMC) sampling and conjugate prior structures. We illustrate the model's capability through both simulated and real-world datasets, demonstrating its ability to capture nuanced variance structures, provide reliable predictive uncertainty quantification, and highlight key predictors influencing both the mean response and its variability. Empirical results show that the Heteroscedastic AddiVortes model offers a substantial improvement in capturing distributional properties compared to both homoscedastic and heteroscedastic alternatives, making it a robust tool for complex regression problems in various applied settings.


Heteroscedastic Double Bayesian Elastic Net

Kimura, Masanari

arXiv.org Machine Learning

In many practical applications, regression models are employed to uncover relationships between predictors and a response variable, yet the common assumption of constant error variance is frequently violated. This issue is further compounded in high-dimensional settings where the number of predictors exceeds the sample size, necessitating regularization for effective estimation and variable selection. To address this problem, we propose the Heteroscedastic Double Bayesian Elastic Net (HDBEN), a novel framework that jointly models the mean and log-variance using hierarchical Bayesian priors incorporating both $\ell_1$ and $\ell_2$ penalties. Our approach simultaneously induces sparsity and grouping in the regression coefficients and variance parameters, capturing complex variance structures in the data. Theoretical results demonstrate that proposed HDBEN achieves posterior concentration, variable selection consistency, and asymptotic normality under mild conditions which justifying its behavior. Simulation studies further illustrate that HDBEN outperforms existing methods, particularly in scenarios characterized by heteroscedasticity and high dimensionality.


Advancing sleep detection by modelling weak label sets: A novel weakly supervised learning approach

Boeker, Matthias, Thambawita, Vajira, Riegler, Michael, Halvorsen, Pål, Hammer, Hugo L.

arXiv.org Artificial Intelligence

Understanding sleep and activity patterns plays a crucial role in physical and mental health. This study introduces a novel approach for sleep detection using weakly supervised learning for scenarios where reliable ground truth labels are unavailable. The proposed method relies on a set of weak labels, derived from the predictions generated by conventional sleep detection algorithms. Introducing a novel approach, we suggest a novel generalised non-linear statistical model in which the number of weak sleep labels is modelled as outcome of a binomial distribution. The probability of sleep in the binomial distribution is linked to the outcomes of neural networks trained to detect sleep based on actigraphy. We show that maximizing the likelihood function of the model, is equivalent to minimizing the soft cross-entropy loss. Additionally, we explored the use of the Brier score as a loss function for weak labels. The efficacy of the suggested modelling framework was demonstrated using the Multi-Ethnic Study of Atherosclerosis dataset. A \gls{lstm} trained on the soft cross-entropy outperformed conventional sleep detection algorithms, other neural network architectures and loss functions in accuracy and model calibration. This research not only advances sleep detection techniques in scenarios where ground truth data is scarce but also contributes to the broader field of weakly supervised learning by introducing innovative approach in modelling sets of weak labels.


Statistical Agnostic Regression: a machine learning method to validate regression models

Gorriz, Juan M, Ramirez, J., Segovia, F., Martinez-Murcia, F. J., Jiménez-Mesa, C., Suckling, J.

arXiv.org Machine Learning

Regression analysis is a central topic in statistical modeling, aiming to estimate the relationships between a dependent variable, commonly referred to as the response variable, and one or more independent variables, i.e., explanatory variables. Linear regression is by far the most popular method for performing this task in several fields of research, such as prediction, forecasting, or causal inference. Beyond various classical methods to solve linear regression problems, such as Ordinary Least Squares, Ridge, or Lasso regressions - which are often the foundation for more advanced machine learning (ML) techniques - the latter have been successfully applied in this scenario without a formal definition of statistical significance. At most, permutation or classical analyses based on empirical measures (e.g., residuals or accuracy) have been conducted to reflect the greater ability of ML estimations for detection. In this paper, we introduce a method, named Statistical Agnostic Regression (SAR), for evaluating the statistical significance of an ML-based linear regression based on concentration inequalities of the actual risk using the analysis of the worst case. To achieve this goal, similar to the classification problem, we define a threshold to establish that there is sufficient evidence with a probability of at least 1-eta to conclude that there is a linear relationship in the population between the explanatory (feature) and the response (label) variables. Simulations in only two dimensions demonstrate the ability of the proposed agnostic test to provide a similar analysis of variance given by the classical $F$ test for the slope parameter.