Lee, Hwiyoung
Graph Canonical Correlation Analysis
Park, Hongju, Bai, Shuyang, Ye, Zhenyao, Lee, Hwiyoung, Ma, Tianzhou, Chen, Shuo
CCA considers the following maximization problem: max a,b(a X Y b) subject to a X X a 1 and b Y Y b 1, where the vectors a and b and the correlation are said to be canonical vectors and canonical correlation if they attain the above maximization. In the classical canonical correlation analysis, the canonical vectors a and b include nonzero loadings for all X and Y variables. However, in a high-dimensional setting with p, q n, the goal is to identify which subsets of X are associated with subsets Y and estimate the measure of associations, as the canonical correlation with the full dataset is overly high due to estimation bias caused by overfitting. To ensure the sparsity, shrinkage methods 4 Biometrics, 000 0000 are commonly used. For example, Witten et al. (2009) propose sparse canonical correlation analysis (sCCA). The criterion of sCCA can be in general expressed as follows: max a,b a X Y b subject to a X X a 1, b Y Y b 1, P 1( a) k 1, P 2( b) k 2, where P 1 and P 2 are convex penalty functions for penalization for a and b with positive constants k 1 and k 2, respectively. A representative penalty function is a ℓ 1 penalty function such that P 1(a) = a 1 and P 2(b) = b 1. sCCA imposes zero loadings in canonical vectors and thus only selects subsets of correlated X and Y . However, sCCA methods may neither fully recover correlated X and Y pairs nor capture the multivariate-to-multivariate linkage patterns (see Figure 3) because the ℓ 1 shrinkage tends to select only a small subset from the associated variables of X and Y .
A Systematic Bias of Machine Learning Regression Models and Its Correction: an Application to Imaging-based Brain Age Prediction
Lee, Hwiyoung, Chen, Shuo
Machine learning models for continuous outcomes often yield systematically biased predictions, particularly for values that largely deviate from the mean. Specifically, predictions for large-valued outcomes tend to be negatively biased, while those for small-valued outcomes are positively biased. We refer to this linear central tendency warped bias as the "systematic bias of machine learning regression". In this paper, we first demonstrate that this issue persists across various machine learning models, and then delve into its theoretical underpinnings. We propose a general constrained optimization approach designed to correct this bias and develop a computationally efficient algorithm to implement our method. Our simulation results indicate that our correction method effectively eliminates the bias from the predicted outcomes. We apply the proposed approach to the prediction of brain age using neuroimaging data. In comparison to competing machine learning models, our method effectively addresses the longstanding issue of "systematic bias of machine learning regression" in neuroimaging-based brain age calculation, yielding unbiased predictions of brain age.