Generalized Principal Component Analysis

Townes, F. William

arXiv.org Machine Learning 

Principal component analysis (PCA) [1] is widely used to reduce the dimensionality of large datasets. However, it implicitly optimizes an objective function that is equivalent to a Gaussian likelihood. Hence, for data such as nonnegative, discrete counts that do not follow the normal distribution, PCA may be inappropriate. A motivating example of count data comes from single cell gene expression profiling (scRNA-Seq) where each observation represents a cell and genes are features. Such data are often highly sparse ( 90% zeros) and exhibit skewed distributions poorly matched by Gaussian noise. To remedy this, Collins [2] proposed generalizing PCA to the exponential family in a manner analogous to the generalization of linear regression to generalized linear models. Here, we provide a detailed derivation of generalized PCA (GLM-PCA) with a focus on optimization using Fisher scoring. We also expand on Collins' model by incorporating covariates, and propose post hoc transformations to enhance interpretability of latent factors.