Principal Component Analysis
Principal Component Analysis for Machine Learning - Translucent
Analyzing large data sets comes with multiple challenges. One of the challenges is to get data in the right structure for the analysis. Without preprocessing the data, your algorithms might have difficult time converging and/or take a long time execute. One of the techniques that we used at TCinc is Principal Component Analysis (PCA). The official definition of PCA from Wikipediai is "Principal component analysis (PCA) is a statistical procedure that uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components."
Uncertainty-Aware Principal Component Analysis
Görtler, Jochen, Spinner, Thilo, Streeb, Dirk, Weiskopf, Daniel, Deussen, Oliver
We present a technique to perform dimensionality reduction on data that is subject to uncertainty. Our method is a generalization of traditional principal component analysis (PCA) to multivariate probability distributions. In comparison to non-linear methods, linear dimensionality reduction techniques have the advantage that the characteristics of such probability distributions remain intact after projection. We derive a representation of the covariance matrix that respects potential uncertainty in each of the observations, building the mathematical foundation of our new method uncertainty-aware PCA. In addition to the accuracy and performance gained by our approach over sampling-based strategies, our formulation allows us to perform sensitivity analysis with regard to the uncertainty in the data. For this, we propose factor traces as a novel visualization that enables us to better understand the influence of uncertainty on the chosen principal components. We provide multiple examples of our technique using real-world datasets and show how to propagate multivariate normal distributions through PCA in closed-form. Furthermore, we discuss extensions and limitations of our approach.
Self-Paced Probabilistic Principal Component Analysis for Data with Outliers
Zhao, Bowen, Xiao, Xi, Zhang, Wanpeng, Zhang, Bin, Xia, Shutao
Principal Component Analysis (PCA) is a popular tool for dimensionality reduction and feature extraction in data analysis. There is a probabilistic version of PCA, known as Probabilistic PCA (PPCA). However, standard PCA and PPCA are not robust, as they are sensitive to outliers. To alleviate this problem, this paper introduces the Self-Paced Learning mechanism into PPCA, and proposes a novel method called Self-Paced Probabilistic Principal Component Analysis (SP-PPCA). Furthermore, we design the corresponding optimization algorithm based on the alternative search strategy and the expectation-maximization algorithm. SP-PPCA looks for optimal projection vectors and filters out outliers iteratively. Experiments on both synthetic problems and real-world datasets clearly demonstrate that SP-PPCA is able to reduce or eliminate the impact of outliers.
Eigenvalue and Generalized Eigenvalue Problems: Tutorial
Ghojogh, Benyamin, Karray, Fakhri, Crowley, Mark
This paper is a tutorial for eigenvalue and generalized eigenvalue problems. We first introduce eigenvalue problem, eigen-decomposition (spectral decomposition), and generalized eigenvalue problem. Then, we mention the optimization problems which yield to the eigenvalue and generalized eigenvalue problems. We also provide examples from machine learning, including principal component analysis, kernel supervised principal component analysis, and Fisher discriminant analysis, which result in eigenvalue and generalized eigenvalue problems. Finally, we introduce the solutions to both eigenvalue and generalized eigenvalue problems.
PCA and SVD explained with numpy
How exactly are principal component analysis and singular value decomposition related and how to implement using numpy. Principal component analysis (PCA) and singular value decomposition (SVD) are commonly used dimensionality reduction approaches in exploratory data analysis (EDA) and Machine Learning. They are both classical linear dimensionality reduction methods that attempt to find linear combinations of features in the original high dimensional data matrix to construct meaningful representation of the dataset. They are preferred by different fields when it comes to reducing the dimensionality: PCA are often used by biologists to analyze and visualize the source variances in datasets from population genetics, transcriptomics, proteomics and microbiome. Meanwhile, SVD, particularly its reduced version truncated SVD, is more popular in the field of natural language processing to achieve a representation of the gigantic while sparse word frequency matrices.
Spherical Principal Component Analysis
Liu, Kai, Li, Qiuwei, Wang, Hua, Tang, Gongguo
In many real-world applications such as text categorization and face recognition, the dimensions of data are usually very high. Dealing with high-dimensional data is computationally expensive while noise or outliers in the data can increase dramatically as the dimension increases. Dimension reduction is one of the most important and effective methods to handle high dimensional data [4, 17, 20]. Among the dimension reduction methods, Principal Component Analysis (PCA) is one of the most widely used methods due to its simplicity and effectiveness. PCA is a statistical procedure that uses an orthogonal transformation to convert a set of correlated variables into a set of linearly uncorrelated principal directions. Usually the number of principal directions is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal direction has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding direction has the highest variance under the constraint that it is orthogonal to the preceding directions. The resulting vectors are an uncorrelated orthogonal basis set. When data points lie in a low-dimensional manifold and the manifold is linear or nearly-linear, the low-dimensional structure of data can be effectively captured by a linear subspace spanned by the principal PCA directions.
Functional Principal Component Analysis for Extrapolating Multi-stream Longitudinal Data
The advance of modern sensor technologies enables collection of multi-stream longitudinal data where multiple signals from different units are collected in real-time. In this article, we present a non-parametric approach to predict the evolution of multi-stream longitudinal data for an in-service unit through borrowing strength from other historical units. Our approach first decomposes each stream into a linear combination of eigenfunctions and their corresponding functional principal component (FPC) scores. A Gaussian process prior for the FPC scores is then established based on a functional semi-metric that measures similarities between streams of historical units and the in-service unit. Finally, an empirical Bayesian updating strategy is derived to update the established prior using real-time stream data obtained from the in-service unit. Experiments on synthetic and real world data show that the proposed framework outperforms state-of-the-art approaches and can effectively account for heterogeneity as well as achieve high predictive accuracy.
Logistic principal component analysis via non-convex singular value thresholding
Song, Yipeng, Westerhuis, Johan A., Smilde, Age K.
Multivariate binary data is becoming abundant in current biological research. Logistic principal component analysis (PCA) is one of the commonly used tools to explore the relationships inside a multivariate binary data set by exploiting the underlying low rank structure. We re-expressed the logistic PCA model based on the latent variable interpretation of the generalized linear model on binary data. The multivariate binary data set is assumed to be the sign observation of an unobserved quantitative data set, on which a low rank structure is assumed to exist. However, the standard logistic PCA model (using exact low rank constraint) is prone to overfitting, which could lead to divergence of some estimated parameters towards infinity. We propose to fit a logistic PCA model through non-convex singular value thresholding to alleviate the overfitting issue. An efficient Majorization-Minimization algorithm is implemented to fit the model and a missing value based cross validation (CV) procedure is introduced for the model selection. Our experiments on realistic simulations of imbalanced binary data and low signal to noise ratio show that the CV error based model selection procedure is successful in selecting the proposed model. Furthermore, the selected model demonstrates superior performance in recovering the underlying low rank structure compared to models with convex nuclear norm penalty and exact low rank constraint. A binary copy number aberration data set is used to illustrate the proposed methodology in practice.
The excluded area of two-dimensional hard particles
Geigenfeind, Thomas, Heras, Daniel de las
The excluded area between a pair of two-dimensional hard particles with given relative orientation is the region in which one particle cannot be located due to the presence of the other particle. The magnitude of the excluded area as a function of the relative particle orientation plays a major role in the determination of the bulk phase behaviour of hard particles. We use principal component analysis to identify the different types of excluded area corresponding to randomly generated two-dimensional hard particles modeled as non-self-intersecting polygons and star lines (line segments radiating from a common origin). Only three principal components are required to have an excellent representation of the value of the excluded area as a function of the relative particle orientation. Independently of the particle shape, the minimum value of the excluded area is always achieved when the particles are antiparallel to each other. The property that affects the value of the excluded area most strongly is the elongation of the particle shape. Principal component analysis identifies four limiting cases of excluded areas with one to four global minima at equispaced relative orientations. We study selected particle shapes using Monte Carlo simulations.
Bandit Principal Component Analysis
Kotłowski, Wojciech, Neu, Gergely
We consider a partial-feedback variant of the well-studied online PCA problem where a learner attempts to predict a sequence of $d$-dimensional vectors in terms of a quadratic loss, while only having limited feedback about the environment's choices. We focus on a natural notion of bandit feedback where the learner only observes the loss associated with its own prediction. Based on the classical observation that this decision-making problem can be lifted to the space of density matrices, we propose an algorithm that is shown to achieve a regret of $O(d^{3/2}\sqrt{T})$ after $T$ rounds in the worst case. We also prove data-dependent bounds that improve on the basic result when the loss matrices of the environment have bounded rank or the loss of the best action is bounded. One version of our algorithm runs in $O(d)$ time per trial which massively improves over every previously known online PCA method. We complement these results by a lower bound of $\Omega(d\sqrt{T})$.