Principal Component Analysis
Improving the Accuracy of Principal Component Analysis by the Maximum Entropy Method
Wan, Guihong, Maung, Crystal, Schweitzer, Haim
Classical Principal Component Analysis (PCA) approximates data in terms of projections on a small number of orthogonal vectors. There are simple procedures to efficiently compute various functions of the data from the PCA approximation. The most important function is arguably the Euclidean distance between data items, This can be used, for example, to solve the approximate nearest neighbor problem. We use random variables to model the inherent uncertainty in such approximations, and apply the Maximum Entropy Method to infer the underlying probability distribution. We propose using the expected values of distances between these random variables as improved estimates of the distance. We show by analysis and experimentally that in most cases results obtained by our method are more accurate than what is obtained by the classical approach. This improves the accuracy of a classical technique that have been used with little change for over 100 years.
Exact Recovery of Tensor Robust Principal Component Analysis under Linear Transforms
This work studies the Tensor Robust Principal Component Analysis (TRPCA) problem, which aims to exactly recover the low-rank and sparse components from their sum. Our model is motivated by the recently proposed linear transforms based tensor-tensor product and tensor SVD. We define a new transforms depended tensor rank and the corresponding tensor nuclear norm. Then we solve the TRPCA problem by convex optimization whose objective is a weighted combination of the new tensor nuclear norm and the $\ell_1$-norm. In theory, we show that under certain incoherence conditions, the convex program exactly recovers the underlying low-rank and sparse components. It is of great interest that our new TRPCA model generalizes existing works. In particular, if the studied tensor reduces to a matrix, our TRPCA model reduces to the known matrix RPCA. Our new TRPCA which is allowed to use general linear transforms can be regarded as an extension of our former TRPCA work which uses the discrete Fourier transform. But their proof of the recovery guarantee is different. Numerical experiments verify our results and the application on image recovery demonstrates the superiority of our method.
All Sparse PCA Models Are Wrong, But Some Are Useful. Part I: Computation of Scores, Residuals and Explained Variance
Camacho, J., Smilde, A. K., Saccenti, E., Westerhuis, J. A.
Sparse Principal Component Analysis (sPCA) is a popular matrix factorization approach based on Principal Component Analysis (PCA) that combines variance maximization and sparsity with the ultimate goal of improving data interpretation. When moving from PCA to sPCA, there are a number of implications that the practitioner needs to be aware of. A relevant one is that scores and loadings in sPCA may not be orthogonal. For this reason, the traditional way of computing scores, residuals and variance explained that is used in the classical PCA cannot directly be applied to sPCA models. This also affects how sPCA components should be visualized. In this paper we illustrate this problem both theoretically and numerically using simulations for several state-of-the-art sPCA algorithms, and provide proper computation of the different elements mentioned. We show that sPCA approaches present disparate and limited performance when modeling noise-free, sparse data. In a follow-up paper, we discuss the theoretical properties that lead to this problem.
Generalized Principal Component Analysis
Principal component analysis (PCA) [1] is widely used to reduce the dimensionality of large datasets. However, it implicitly optimizes an objective function that is equivalent to a Gaussian likelihood. Hence, for data such as nonnegative, discrete counts that do not follow the normal distribution, PCA may be inappropriate. A motivating example of count data comes from single cell gene expression profiling (scRNA-Seq) where each observation represents a cell and genes are features. Such data are often highly sparse ( 90% zeros) and exhibit skewed distributions poorly matched by Gaussian noise. To remedy this, Collins [2] proposed generalizing PCA to the exponential family in a manner analogous to the generalization of linear regression to generalized linear models. Here, we provide a detailed derivation of generalized PCA (GLM-PCA) with a focus on optimization using Fisher scoring. We also expand on Collins' model by incorporating covariates, and propose post hoc transformations to enhance interpretability of latent factors.
Robust Tensor Completion Using Transformed Tensor SVD
Song, Guangjing, Ng, Michael K., Zhang, Xiongjun
Tensor (multidimensional arrays) are generalizations of vectors and matrices, which can be used as a powerful tool in modeling multidimensional data such as videos [29], color images [36, 40], hyperspectral images [11, 35, 49], and electroencephalography (EEG) [8]. Based on its multilinear algebraic properties, a tensor can take full advantage of its structures to provide better understanding and higher accuracy of the multidimensional data. In many tensor data applications [9, 20, 27, 33, 37, 40, 41, 47, 50], tensor data sets are often corrupted and/or incomplete owing to various unpredictable or unavoidable situations. It is motivated us to perform tensor completion and tensor robust principal component analysis for multidimensional data processing. Compared with matrix completion and robust principal component analysis, tensor completion and tensor robust principal component analysis are far from being well-studied. The main issues are the definitions of tensor ranks and tensor decompositions.
Cross-product Penalized Component Analysis (XCAN)
Camacho, Josรฉ, Acar, Evrim, Rasmussen, Morten A., Bro, Rasmus
Matrix factorization methods are extensively employed to understand complex data. In this paper, we introduce the cross-product penalized component analysis (XCAN), a sparse matrix factorization based on the optimization of a loss function that allows a trade-off between variance maximization and structural preservation. The approach is based on previous developments, notably (i) the Sparse Principal Component Analysis (SPCA) framework based on the LASSO, (ii) extensions of SPCA to constrain both modes of the factorization, like co-clustering or the Penalized Matrix Decomposition (PMD), and (iii) the Group-wise Principal Component Analysis (GPCA) method. The result is a flexible modeling approach that can be used for data exploration in a large variety of problems. We demonstrate its use with applications from different disciplines.
Principal Component Analysis for Multivariate Extremes
The first order behavior of multivariate heavy-tailed random vectors above large radial thresholds is ruled by a limit measure in a regular variation framework. For a high dimensional vector, a reasonable assumption is that the support of this measure is concentrated on a lower dimensional subspace, meaning that certain linear combinations of the components are much likelier to be large than others. Identifying this subspace and thus reducing the dimension will facilitate a refined statistical analysis. In this work we apply Principal Component Analysis (PCA) to a re-scaled version of radially thresholded observations. Within the statistical learning framework of empirical risk minimization, our main focus is to analyze the squared reconstruction error for the exceedances over large radial thresholds. We prove that the empirical risk converges to the true risk, uniformly over all projection subspaces. As a consequence, the best projection subspace is shown to converge in probability to the optimal one, in terms of the Hausdorff distance between their intersections with the unit sphere. In addition, if the exceedances are re-scaled to the unit ball, we obtain finite sample uniform guarantees to the reconstruction error pertaining to the estimated projection sub-space. Numerical experiments illustrate the relevance of the proposed framework for practical purposes.
The Components of Principal Component Analysis: A Python Tutorial Math Misery?
I recently ran a data science training course on the topic of principal component analysis and dimension reduction. This course was less about the intimate mathematical details, but rather on understanding the various outputs that are available when running PCA. In other words, my goal was to make sure that followers of this tutorial can see what terms like "explained_variance_" and "explained_variance_ratio_" and "components_" mean when they probe the PCA object. It shouldn't be a mystery and it should be something that anyone can recreate "by hand". My training sessions tend to be fluid and no one session is the same as any other.
Unsupervised and Supervised Principal Component Analysis: Tutorial
Ghojogh, Benyamin, Crowley, Mark
This is a detailed tutorial paper which explains the Principal Component Analysis (PCA), Supervised PCA (SPCA), kernel PCA, and kernel SPCA. We start with projection, PCA with eigen-decomposition, PCA with one and multiple projection directions, properties of the projection matrix, reconstruction error minimization, and we connect to auto-encoder. Then, PCA with singular value decomposition, dual PCA, and kernel PCA are covered. SPCA using both scoring and Hilbert-Schmidt independence criterion are explained. Kernel SPCA using both direct and dual approaches are then introduced. We cover all cases of projection and reconstruction of training and out-of-sample data. Finally, some simulations are provided on Frey and AT&T face datasets for verifying the theory in practice.
Adaptive probabilistic principal component analysis
Farooq, Adam, Raykov, Yordan P., Evers, Luc, Little, Max A.
Using the linear Gaussian latent variable model as a starting point we relax some of the constraints it imposes by deriving a nonparametric latent feature Gaussian variable model. This model introduces additional discrete latent variables to the original structure. The Bayesian nonparametric nature of this new model allows it to adapt complexity as more data is observed and project each data point onto a varying number of subspaces. The linear relationship between the continuous latent and observed variables make the proposed model straightforward to interpret, resembling a locally adaptive probabilistic PCA (A-PPCA). We propose two alternative Gibbs sampling procedures for inference in the new model and demonstrate its applicability on sensor data for passive health monitoring.