Goto

Collaborating Authors

 Principal Component Analysis


The Sparse Reverse of Principal Component Analysis for Fast Low-Rank Matrix Completion

arXiv.org Machine Learning

Matrix completion constantly receives tremendous attention from many research fields. It is commonly applied for recommender systems such as movie ratings, computer vision such as image reconstruction or completion, multi-task learning such as collaboratively modeling time-series trends of multiple sensors, and many other applications. Matrix completion techniques are usually computationally exhaustive and/or fail to capture the heterogeneity in the data. For example, images usually contain a heterogeneous set of objects, and thus it is a challenging task to reconstruct images with high levels of missing data. In this paper, we propose the sparse reverse of principal component analysis for matrix completion. The proposed approach maintains smoothness across the matrix, produces accurate estimates of the missing data, converges iteratively, and it is computationally tractable with a controllable upper bound on the number of iterations until convergence. The accuracy of the proposed technique is validated on natural images, movie ratings, and multisensor data. It is also compared with common benchmark methods used for matrix completion.


Adapting Machine Learning Algorithms to Novel Use Cases

#artificialintelligence

Measuring the kurtosis (fourth moment) of a statistical distribution may seem like a pedantic exercise for an introductory statistics class, after one learns about the first moment (mean), second moment (variance), and third moment (skewness). However, kurtosis can be used for some unexpected and significant use cases, particularly when employed as a special case of Independent Component Analysis (ICA) for unsupervised learning. ICA is a variant of PCA (Principal Component Analysis) in situations where the data distribution contains subcomponents that are statistically independent of each other, though generally not orthogonal.


Principal Component Analysis Using Structural Similarity Index for Images

arXiv.org Machine Learning

Despite the advances of deep learning in specific tasks using images, the principled assessment of image fidelity and similarity is still a critical ability to develop. As it has been shown that Mean Squared Error (MSE) is insufficient for this task, other measures have been developed with one of the most effective being Structural Similarity Index (SSIM). Such measures can be used for subspace learning but existing methods in machine learning, such as Principal Component Analysis (PCA), are based on Euclidean distance or MSE and thus cannot properly capture the structural features of images. In this paper, we define an image structure subspace which discriminates different types of image distortions. We propose Image Structural Component Analysis (ISCA) and also kernel ISCA by using SSIM, rather than Euclidean distance, in the formulation of PCA. This paper provides a bridge between image quality assessment and manifold learning opening a broad new area for future research.


Improving the Accuracy of Principal Component Analysis by the Maximum Entropy Method

arXiv.org Machine Learning

Classical Principal Component Analysis (PCA) approximates data in terms of projections on a small number of orthogonal vectors. There are simple procedures to efficiently compute various functions of the data from the PCA approximation. The most important function is arguably the Euclidean distance between data items, This can be used, for example, to solve the approximate nearest neighbor problem. We use random variables to model the inherent uncertainty in such approximations, and apply the Maximum Entropy Method to infer the underlying probability distribution. We propose using the expected values of distances between these random variables as improved estimates of the distance. We show by analysis and experimentally that in most cases results obtained by our method are more accurate than what is obtained by the classical approach. This improves the accuracy of a classical technique that have been used with little change for over 100 years.


Exact Recovery of Tensor Robust Principal Component Analysis under Linear Transforms

arXiv.org Machine Learning

This work studies the Tensor Robust Principal Component Analysis (TRPCA) problem, which aims to exactly recover the low-rank and sparse components from their sum. Our model is motivated by the recently proposed linear transforms based tensor-tensor product and tensor SVD. We define a new transforms depended tensor rank and the corresponding tensor nuclear norm. Then we solve the TRPCA problem by convex optimization whose objective is a weighted combination of the new tensor nuclear norm and the $\ell_1$-norm. In theory, we show that under certain incoherence conditions, the convex program exactly recovers the underlying low-rank and sparse components. It is of great interest that our new TRPCA model generalizes existing works. In particular, if the studied tensor reduces to a matrix, our TRPCA model reduces to the known matrix RPCA. Our new TRPCA which is allowed to use general linear transforms can be regarded as an extension of our former TRPCA work which uses the discrete Fourier transform. But their proof of the recovery guarantee is different. Numerical experiments verify our results and the application on image recovery demonstrates the superiority of our method.


All Sparse PCA Models Are Wrong, But Some Are Useful. Part I: Computation of Scores, Residuals and Explained Variance

arXiv.org Machine Learning

Sparse Principal Component Analysis (sPCA) is a popular matrix factorization approach based on Principal Component Analysis (PCA) that combines variance maximization and sparsity with the ultimate goal of improving data interpretation. When moving from PCA to sPCA, there are a number of implications that the practitioner needs to be aware of. A relevant one is that scores and loadings in sPCA may not be orthogonal. For this reason, the traditional way of computing scores, residuals and variance explained that is used in the classical PCA cannot directly be applied to sPCA models. This also affects how sPCA components should be visualized. In this paper we illustrate this problem both theoretically and numerically using simulations for several state-of-the-art sPCA algorithms, and provide proper computation of the different elements mentioned. We show that sPCA approaches present disparate and limited performance when modeling noise-free, sparse data. In a follow-up paper, we discuss the theoretical properties that lead to this problem.


Generalized Principal Component Analysis

arXiv.org Machine Learning

Principal component analysis (PCA) [1] is widely used to reduce the dimensionality of large datasets. However, it implicitly optimizes an objective function that is equivalent to a Gaussian likelihood. Hence, for data such as nonnegative, discrete counts that do not follow the normal distribution, PCA may be inappropriate. A motivating example of count data comes from single cell gene expression profiling (scRNA-Seq) where each observation represents a cell and genes are features. Such data are often highly sparse ( 90% zeros) and exhibit skewed distributions poorly matched by Gaussian noise. To remedy this, Collins [2] proposed generalizing PCA to the exponential family in a manner analogous to the generalization of linear regression to generalized linear models. Here, we provide a detailed derivation of generalized PCA (GLM-PCA) with a focus on optimization using Fisher scoring. We also expand on Collins' model by incorporating covariates, and propose post hoc transformations to enhance interpretability of latent factors.


Robust Tensor Completion Using Transformed Tensor SVD

arXiv.org Machine Learning

Tensor (multidimensional arrays) are generalizations of vectors and matrices, which can be used as a powerful tool in modeling multidimensional data such as videos [29], color images [36, 40], hyperspectral images [11, 35, 49], and electroencephalography (EEG) [8]. Based on its multilinear algebraic properties, a tensor can take full advantage of its structures to provide better understanding and higher accuracy of the multidimensional data. In many tensor data applications [9, 20, 27, 33, 37, 40, 41, 47, 50], tensor data sets are often corrupted and/or incomplete owing to various unpredictable or unavoidable situations. It is motivated us to perform tensor completion and tensor robust principal component analysis for multidimensional data processing. Compared with matrix completion and robust principal component analysis, tensor completion and tensor robust principal component analysis are far from being well-studied. The main issues are the definitions of tensor ranks and tensor decompositions.


Cross-product Penalized Component Analysis (XCAN)

arXiv.org Machine Learning

Matrix factorization methods are extensively employed to understand complex data. In this paper, we introduce the cross-product penalized component analysis (XCAN), a sparse matrix factorization based on the optimization of a loss function that allows a trade-off between variance maximization and structural preservation. The approach is based on previous developments, notably (i) the Sparse Principal Component Analysis (SPCA) framework based on the LASSO, (ii) extensions of SPCA to constrain both modes of the factorization, like co-clustering or the Penalized Matrix Decomposition (PMD), and (iii) the Group-wise Principal Component Analysis (GPCA) method. The result is a flexible modeling approach that can be used for data exploration in a large variety of problems. We demonstrate its use with applications from different disciplines.


Principal Component Analysis for Multivariate Extremes

arXiv.org Machine Learning

The first order behavior of multivariate heavy-tailed random vectors above large radial thresholds is ruled by a limit measure in a regular variation framework. For a high dimensional vector, a reasonable assumption is that the support of this measure is concentrated on a lower dimensional subspace, meaning that certain linear combinations of the components are much likelier to be large than others. Identifying this subspace and thus reducing the dimension will facilitate a refined statistical analysis. In this work we apply Principal Component Analysis (PCA) to a re-scaled version of radially thresholded observations. Within the statistical learning framework of empirical risk minimization, our main focus is to analyze the squared reconstruction error for the exceedances over large radial thresholds. We prove that the empirical risk converges to the true risk, uniformly over all projection subspaces. As a consequence, the best projection subspace is shown to converge in probability to the optimal one, in terms of the Hausdorff distance between their intersections with the unit sphere. In addition, if the exceedances are re-scaled to the unit ball, we obtain finite sample uniform guarantees to the reconstruction error pertaining to the estimated projection sub-space. Numerical experiments illustrate the relevance of the proposed framework for practical purposes.