Principal component analysis (PCA) is recognised as a quintessential data analysis technique when it comes to describing linear relationships between the features of a dataset. However, the well-known sensitivity of PCA to non-Gaussian samples and/or outliers often makes it unreliable in practice. To this end, a robust formulation of PCA is derived based on the maximum correntropy criterion (MCC) so as to maximise the expected likelihood of Gaussian distributed reconstruction errors. In this way, the proposed solution reduces to a generalised power iteration, whereby: (i) robust estimates of the principal components are obtained even in the presence of outliers; (ii) the number of principal components need not be specified in advance; and (iii) the entire set of principal components can be obtained, unlike existing approaches. The advantages of the proposed maximum correntropy power iteration (MCPI) are demonstrated through an intuitive numerical example.
Despite the advances of deep learning in specific tasks using images, the principled assessment of image fidelity and similarity is still a critical ability to develop. As it has been shown that Mean Squared Error (MSE) is insufficient for this task, other measures have been developed with one of the most effective being Structural Similarity Index (SSIM). Such measures can be used for subspace learning but existing methods in machine learning, such as Principal Component Analysis (PCA), are based on Euclidean distance or MSE and thus cannot properly capture the structural features of images. In this paper, we define an image structure subspace which discriminates different types of image distortions. We propose Image Structural Component Analysis (ISCA) and also kernel ISCA by using SSIM, rather than Euclidean distance, in the formulation of PCA. This paper provides a bridge between image quality assessment and manifold learning opening a broad new area for future research.
Classical Principal Component Analysis (PCA) approximates data in terms of projections on a small number of orthogonal vectors. There are simple procedures to efficiently compute various functions of the data from the PCA approximation. The most important function is arguably the Euclidean distance between data items, This can be used, for example, to solve the approximate nearest neighbor problem. We use random variables to model the inherent uncertainty in such approximations, and apply the Maximum Entropy Method to infer the underlying probability distribution. We propose using the expected values of distances between these random variables as improved estimates of the distance. We show by analysis and experimentally that in most cases results obtained by our method are more accurate than what is obtained by the classical approach. This improves the accuracy of a classical technique that have been used with little change for over 100 years.
This work studies the Tensor Robust Principal Component Analysis (TRPCA) problem, which aims to exactly recover the low-rank and sparse components from their sum. Our model is motivated by the recently proposed linear transforms based tensor-tensor product and tensor SVD. We define a new transforms depended tensor rank and the corresponding tensor nuclear norm. Then we solve the TRPCA problem by convex optimization whose objective is a weighted combination of the new tensor nuclear norm and the $\ell_1$-norm. In theory, we show that under certain incoherence conditions, the convex program exactly recovers the underlying low-rank and sparse components. It is of great interest that our new TRPCA model generalizes existing works. In particular, if the studied tensor reduces to a matrix, our TRPCA model reduces to the known matrix RPCA. Our new TRPCA which is allowed to use general linear transforms can be regarded as an extension of our former TRPCA work which uses the discrete Fourier transform. But their proof of the recovery guarantee is different. Numerical experiments verify our results and the application on image recovery demonstrates the superiority of our method.
Principal component analysis (PCA)  is widely used to reduce the dimensionality of large datasets. However, it implicitly optimizes an objective function that is equivalent to a Gaussian likelihood. Hence, for data such as nonnegative, discrete counts that do not follow the normal distribution, PCA may be inappropriate. A motivating example of count data comes from single cell gene expression profiling (scRNA-Seq) where each observation represents a cell and genes are features. Such data are often highly sparse ( 90% zeros) and exhibit skewed distributions poorly matched by Gaussian noise. To remedy this, Collins  proposed generalizing PCA to the exponential family in a manner analogous to the generalization of linear regression to generalized linear models. Here, we provide a detailed derivation of generalized PCA (GLM-PCA) with a focus on optimization using Fisher scoring. We also expand on Collins' model by incorporating covariates, and propose post hoc transformations to enhance interpretability of latent factors.
The first order behavior of multivariate heavy-tailed random vectors above large radial thresholds is ruled by a limit measure in a regular variation framework. For a high dimensional vector, a reasonable assumption is that the support of this measure is concentrated on a lower dimensional subspace, meaning that certain linear combinations of the components are much likelier to be large than others. Identifying this subspace and thus reducing the dimension will facilitate a refined statistical analysis. In this work we apply Principal Component Analysis (PCA) to a re-scaled version of radially thresholded observations. Within the statistical learning framework of empirical risk minimization, our main focus is to analyze the squared reconstruction error for the exceedances over large radial thresholds. We prove that the empirical risk converges to the true risk, uniformly over all projection subspaces. As a consequence, the best projection subspace is shown to converge in probability to the optimal one, in terms of the Hausdorff distance between their intersections with the unit sphere. In addition, if the exceedances are re-scaled to the unit ball, we obtain finite sample uniform guarantees to the reconstruction error pertaining to the estimated projection sub-space. Numerical experiments illustrate the relevance of the proposed framework for practical purposes.
I recently ran a data science training course on the topic of principal component analysis and dimension reduction. This course was less about the intimate mathematical details, but rather on understanding the various outputs that are available when running PCA. In other words, my goal was to make sure that followers of this tutorial can see what terms like "explained_variance_" and "explained_variance_ratio_" and "components_" mean when they probe the PCA object. It shouldn't be a mystery and it should be something that anyone can recreate "by hand". My training sessions tend to be fluid and no one session is the same as any other.
Using the linear Gaussian latent variable model as a starting point we relax some of the constraints it imposes by deriving a nonparametric latent feature Gaussian variable model. This model introduces additional discrete latent variables to the original structure. The Bayesian nonparametric nature of this new model allows it to adapt complexity as more data is observed and project each data point onto a varying number of subspaces. The linear relationship between the continuous latent and observed variables make the proposed model straightforward to interpret, resembling a locally adaptive probabilistic PCA (A-PPCA). We propose two alternative Gibbs sampling procedures for inference in the new model and demonstrate its applicability on sensor data for passive health monitoring.
Analyzing large data sets comes with multiple challenges. One of the challenges is to get data in the right structure for the analysis. Without preprocessing the data, your algorithms might have difficult time converging and/or take a long time execute. One of the techniques that we used at TCinc is Principal Component Analysis (PCA). The official definition of PCA from Wikipediai is "Principal component analysis (PCA) is a statistical procedure that uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components."
In many real-world applications such as text categorization and face recognition, the dimensions of data are usually very high. Dealing with high-dimensional data is computationally expensive while noise or outliers in the data can increase dramatically as the dimension increases. Dimension reduction is one of the most important and effective methods to handle high dimensional data [4, 17, 20]. Among the dimension reduction methods, Principal Component Analysis (PCA) is one of the most widely used methods due to its simplicity and effectiveness. PCA is a statistical procedure that uses an orthogonal transformation to convert a set of correlated variables into a set of linearly uncorrelated principal directions. Usually the number of principal directions is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal direction has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding direction has the highest variance under the constraint that it is orthogonal to the preceding directions. The resulting vectors are an uncorrelated orthogonal basis set. When data points lie in a low-dimensional manifold and the manifold is linear or nearly-linear, the low-dimensional structure of data can be effectively captured by a linear subspace spanned by the principal PCA directions.