Principal Component Analysis
Principal Component Analysis
Principal Component Analysis (PCA) is a simple yet popular and useful linear transformation technique that is used in numerous applications, such as stock market predictions, the analysis of gene expression data, and many more. In this tutorial, we will see that PCA is not just a "black box", and we are going to unravel its internals in 3 basic steps. The sheer size of data in the modern age is not only a challenge for computer hardware but also a main bottleneck for the performance of many machine learning algorithms. The main goal of a PCA analysis is to identify patterns in data; PCA aims to detect the correlation between variables. If a strong correlation between variables exists, the attempt to reduce the dimensionality only makes sense.
Dynamic Principal Component Analysis: Identifying the Relationship between Multiple Air Pollutants
Melnikov, Oleg, Raun, Loren H., Ensor, Katherine B.
The dynamic nature of air quality chemistry and transport makes it difficult to identify the mixture of air pollutants for a region. In this study of air quality in the Houston metropolitan area we apply dynamic principal component analysis (DPCA) to a normalized multivariate time series of daily concentration measurements of five pollutants (O3, CO, NO2, SO2, PM2.5) from January 1, 2009 through December 31, 2011 for each of the 24 hours in a day. The resulting dynamic components are examined by hour across days for the 3 year period. Diurnal and seasonal patterns are revealed underlining times when DPCA performs best and two principal components (PCs) explain most variability in the multivariate series. DPCA is shown to be superior to static principal component analysis (PCA) in discovery of linear relations among transformed pollutant measurements. DPCA captures the time-dependent correlation structure of the underlying pollutants recorded at up to 34 monitoring sites in the region. In winter mornings the first principal component (PC1) (mainly CO and NO2) explains up to 70% of variability. Augmenting with the second principal component (PC2) (mainly driven by SO2) the explained variability rises to 90%. In the afternoon, O3 gains prominence in the second principal component. The seasonal profile of PCs' contribution to variance loses its distinction in the afternoon, yet cumulatively PC1 and PC2 still explain up to 65% of variability in ambient air data. DPCA provides a strategy for identifying the changing air quality profile for the region studied.
Asymptotic properties of Principal Component Analysis and shrinkage-bias adjustment under the Generalized Spiked Population model
With the development of high-throughput technologies, principal component analysis (PCA) in the high-dimensional regime is of great interest. Most of the existing theoretical and methodological results for high-dimensional PCA are based on the spiked population model in which all the population eigenvalues are equal except for a few large ones. Due to the presence of local correlation among features, however, this assumption may not be satisfied in many real-world datasets. To address this issue, we investigated the asymptotic behaviors of PCA under the generalized spiked population model. Based on the theoretical results, we proposed a series of methods for the consistent estimation of population eigenvalues, angles between the sample and population eigenvectors, correlation coefficients between the sample and population principal component (PC) scores, and the shrinkage bias adjustment for the predicted PC scores. Using numerical experiments and real data examples from the genetics literature, we showed that our methods can greatly reduce bias and improve prediction accuracy.
A Novel Approach for Phase Identification in Smart Grids Using Graph Theory and Principal Component Analysis
Jayadev, P Satya, Rajeswaran, Aravind, Bhatt, Nirav P, Pasumarthy, Ramkrishna
Consumers with low demand, like households, are generally supplied single-phase power by connecting their service mains to one of the phases of a distribution transformer. The distribution companies face the problem of keeping a record of consumer connectivity to a phase due to uninformed changes that happen. The exact phase connectivity information is important for the efficient operation and control of distribution system. We propose a new data driven approach to the problem based on Principal Component Analysis (PCA) and its Graph Theoretic interpretations, using energy measurements in equally timed short intervals, generated from smart meters. We propose an algorithm for inferring phase connectivity from noisy measurements. The algorithm is demonstrated using simulated data for phase connectivities in distribution networks.
Suppressing Background Radiation Using Poisson Principal Component Analysis
Tandon, P., Huggins, P., Dubrawski, A., Labov, S., Nelson, K.
Performance of nuclear threat detection systems based on gamma-ray spectrometry often strongly depends on the ability to identify the part of measured signal that can be attributed to background radiation. We have successfully applied a method based on Principal Component Analysis (PCA) to obtain a compact null-space model of background spectra using PCA projection residuals to derive a source detection score. We have shown the method's utility in a threat detection system using mobile spectrometers in urban scenes (Tandon et al 2012). While it is commonly assumed that measured photon counts follow a Poisson process, standard PCA makes a Gaussian assumption about the data distribution, which may be a poor approximation when photon counts are low. This paper studies whether and in what conditions PCA with a Poisson-based loss function (Poisson PCA) can outperform standard Gaussian PCA in modeling background radiation to enable more sensitive and specific nuclear threat detection.
Wishart Mechanism for Differentially Private Principal Components Analysis
Jiang, Wuxuan (Shanghai Jiao Tong University) | Xie, Cong (Shanghai Jiao Tong University) | Zhang, Zhihua (Shanghai Jiao Tong University)
We propose a new input perturbation mechanism for publishing a covariance matrix to achieve (epsilon,0)-differential privacy. Our mechanism uses a Wishart distribution to generate matrix noise. In particular, we apply this mechanism to principal component analysis (PCA). Our mechanism is able to keep the positive semi-definiteness of the published covariance matrix. Thus, our approach gives rise to a general publishing framework for input perturbation of a symmetric positive semidefinite matrix. Moreover, compared with the classic Laplace mechanism, our method has better utility guarantee. To the best of our knowledge, the Wishart mechanism is the best input perturbation approach for (epsilon,0)-differentially private PCA. We also compare our work with previous exponential mechanism algorithms in the literature and provide near optimal bound while having more flexibility and less computational intractability.
Principal Component Analysis using R
Technically speaking, PCA uses orthogonal projection of highly correlated variables to a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This linear transformation is defined in such a way that the first principal component has the largest possible variance. It accounts for as much of the variability in the data as possible by considering highly correlated features. Each succeeding component in turn has the highest variance using the features that are less correlated with the first principal component and that are orthogonal to the preceding component.
Phase transitions and sample complexity in Bayes-optimal matrix factorization
Kabashima, Yoshiyuki, Krzakala, Florent, Mézard, Marc, Sakata, Ayaka, Zdeborová, Lenka
We analyse the matrix factorization problem. Given a noisy measurement of a product of two matrices, the problem is to estimate back the original matrices. It arises in many applications such as dictionary learning, blind matrix calibration, sparse principal component analysis, blind source separation, low rank matrix completion, robust principal component analysis or factor analysis. It is also important in machine learning: unsupervised representation learning can often be studied through matrix factorization. We use the tools of statistical mechanics - the cavity and replica methods - to analyze the achievability and computational tractability of the inference problems in the setting of Bayes-optimal inference, which amounts to assuming that the two matrices have random independent elements generated from some known distribution, and this information is available to the inference algorithm. In this setting, we compute the minimal mean-squared-error achievable in principle in any computational time, and the error that can be achieved by an efficient approximate message passing algorithm. The computation is based on the asymptotic state-evolution analysis of the algorithm. The performance that our analysis predicts, both in terms of the achieved mean-squared-error, and in terms of sample complexity, is extremely promising and motivating for a further development of the algorithm.
Principal Component Projection Without Principal Component Analysis
Frostig, Roy, Musco, Cameron, Musco, Christopher, Sidford, Aaron
We show how to efficiently project a vector onto the top principal components of a matrix, without explicitly computing these components. Specifically, we introduce an iterative algorithm that provably computes the projection using few calls to any black-box routine for ridge regression. By avoiding explicit principal component analysis (PCA), our algorithm is the first with no runtime dependence on the number of top principal components. We show that it can be used to give a fast iterative method for the popular principal component regression problem, giving the first major runtime improvement over the naive method of combining PCA with regression. To achieve our results, we first observe that ridge regression can be used to obtain a "smooth projection" onto the top principal components. We then sharpen this approximation to true projection using a low-degree polynomial approximation to the matrix step function. Step function approximation is a topic of long-term interest in scientific computing. We extend prior theory by constructing polynomials with simple iterative structure and rigorously analyzing their behavior under limited precision.
Modularity Component Analysis versus Principal Component Analysis
In this paper the exact linear relation between the leading eigenvectors of the modularity matrix and the singular vectors of an uncentered data matrix is developed. Based on this analysis the concept of a modularity component is defined, and its properties are developed. It is shown that modularity component analysis can be used to cluster data similar to how traditional principal component analysis is used except that modularity component analysis does not require data centering.