Continuum directions for supervised dimension reduction

Jung, Sungkyu

arXiv.org Machine Learning 

Dimension reduction of multivariate data supervised by auxiliary information is considered. A series of basis for dimension reduction is obtained as minimizers of a novel criterion. The proposed method is akin to continuum regression, and the resulting basis is called continuum directions. With a presence of binary supervision data, these directions continuously bridge the principal component, mean difference and linear discriminant directions, thus ranging from unsupervised to fully supervised dimension reduction. High-dimensional asymptotic studies of continuum directions for binary supervision reveal several interesting facts. The conditions under which the sample continuum directions are inconsistent, but their classification performance is good, are specified. While the proposed method can be directly used for binary and multi-category classification, its generalizations to incorporate any form of auxiliary data are also presented. The proposed method enjoys fast computation, and the performance is better or on par with more computer-intensive alternatives. Keywords: continuum regression, dimension reduction, linear discriminant analysis, high-dimension, low-sample-size (HDLSS), maximum data piling, principal component analysis 2000 MSC: 60K35 1. Introduction In modern complex data, it becomes increasingly common that multiple data sets are available. Two types of data are collected on a same set of subjects: a data set of primary interestX and an auxiliary data setY . The goal of supervised dimension reduction is to delineate major signals inX, dependent toY . Relevant application areas include genomics (genetic studies collect both gene expression and SNP data--Li et al. (2016)), finance data (stocks asX in relation to characteristicsY of each stock: size, value, momentum and volatility--Connor et al. (2012)), and batch effect adjustments (Lee et al., 2014). There has been a number of work in dealing with the multi-source data situation. Lock et al. (2013) developed JIVE to separate joint variation from individual variations. Large-scale correlation studies can identify millions of pairwise associations between two data sets via multiple canonical correlation analysis (Witten and Tibshirani, 2009). These methods, however, do not provide supervised dimension reduction of a particular data setX, since all data sets assume an equal role. In contrast, reduced-rank regression (RRR, Izenman, 1975; Tso, 1981) and envelop models (Cook et al., 2010) provide sufficient dimension reduction (Cook and Ni, 2005) for regression problems. See Cook et al. (2013) for connections between envelops and partial least square regression.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found