I recently ran a data science training course on the topic of principal component analysis and dimension reduction. This course was less about the intimate mathematical details, but rather on understanding the various outputs that are available when running PCA. In other words, my goal was to make sure that followers of this tutorial can see what terms like "explained_variance_" and "explained_variance_ratio_" and "components_" mean when they probe the PCA object. It shouldn't be a mystery and it should be something that anyone can recreate "by hand". My training sessions tend to be fluid and no one session is the same as any other.
Using the linear Gaussian latent variable model as a starting point we relax some of the constraints it imposes by deriving a nonparametric latent feature Gaussian variable model. This model introduces additional discrete latent variables to the original structure. The Bayesian nonparametric nature of this new model allows it to adapt complexity as more data is observed and project each data point onto a varying number of subspaces. The linear relationship between the continuous latent and observed variables make the proposed model straightforward to interpret, resembling a locally adaptive probabilistic PCA (A-PPCA). We propose two alternative Gibbs sampling procedures for inference in the new model and demonstrate its applicability on sensor data for passive health monitoring.
Analyzing large data sets comes with multiple challenges. One of the challenges is to get data in the right structure for the analysis. Without preprocessing the data, your algorithms might have difficult time converging and/or take a long time execute. One of the techniques that we used at TCinc is Principal Component Analysis (PCA). The official definition of PCA from Wikipediai is "Principal component analysis (PCA) is a statistical procedure that uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components."
In many real-world applications such as text categorization and face recognition, the dimensions of data are usually very high. Dealing with high-dimensional data is computationally expensive while noise or outliers in the data can increase dramatically as the dimension increases. Dimension reduction is one of the most important and effective methods to handle high dimensional data [4, 17, 20]. Among the dimension reduction methods, Principal Component Analysis (PCA) is one of the most widely used methods due to its simplicity and effectiveness. PCA is a statistical procedure that uses an orthogonal transformation to convert a set of correlated variables into a set of linearly uncorrelated principal directions. Usually the number of principal directions is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal direction has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding direction has the highest variance under the constraint that it is orthogonal to the preceding directions. The resulting vectors are an uncorrelated orthogonal basis set. When data points lie in a low-dimensional manifold and the manifold is linear or nearly-linear, the low-dimensional structure of data can be effectively captured by a linear subspace spanned by the principal PCA directions.
Multivariate binary data is becoming abundant in current biological research. Logistic principal component analysis (PCA) is one of the commonly used tools to explore the relationships inside a multivariate binary data set by exploiting the underlying low rank structure. We re-expressed the logistic PCA model based on the latent variable interpretation of the generalized linear model on binary data. The multivariate binary data set is assumed to be the sign observation of an unobserved quantitative data set, on which a low rank structure is assumed to exist. However, the standard logistic PCA model (using exact low rank constraint) is prone to overfitting, which could lead to divergence of some estimated parameters towards infinity. We propose to fit a logistic PCA model through non-convex singular value thresholding to alleviate the overfitting issue. An efficient Majorization-Minimization algorithm is implemented to fit the model and a missing value based cross validation (CV) procedure is introduced for the model selection. Our experiments on realistic simulations of imbalanced binary data and low signal to noise ratio show that the CV error based model selection procedure is successful in selecting the proposed model. Furthermore, the selected model demonstrates superior performance in recovering the underlying low rank structure compared to models with convex nuclear norm penalty and exact low rank constraint. A binary copy number aberration data set is used to illustrate the proposed methodology in practice.
This paper describes some applications of an incremental implementation of the principal component analysis (PCA). The algorithm updates the transformation coefficients matrix on-line for each new sample, without the need to keep all the samples in memory. The algorithm is formally equivalent to the usual batch version, in the sense that given a sample set the transformation coefficients at the end of the process are the same. The implications of applying the PCA in real time are discussed with the help of data analysis examples. In particular we focus on the problem of the continuity of the PCs during an on-line analysis.
We propose algorithms for online principal component analysis (PCA) and variance minimization for adaptive settings. Previous literature has focused on upper bounding the static adversarial regret, whose comparator is the optimal fixed action in hindsight. However, static regret is not an appropriate metric when the underlying environment is changing. Instead, we adopt the adaptive regret metric from the previous literature and propose online adaptive algorithms for PCA and variance minimization, that have sub-linear adaptive regret guarantees. We demonstrate both theoretically and experimentally that the proposed algorithms can adapt to the changing environments.
Big Data is increasingly becoming the norm and affecting many domains. When there's lots of data involving multiple variables, the work of a data scientist gets difficult. Algorithms will also take longer to complete. Wouldn't it be sensible to identify and consider only those variables that influence the most and discard others? This in turn leads to compression since the less important information are discarded.
Principal Component Analysis is a novel way of of dimensionality reduction. This problem essentially boils down to finding the top k eigen vectors of the data covariance matrix. A considerable amount of literature is found on algorithms meant to do so such as an online method be Warmuth and Kuzmin, Matrix Stochastic Gradient by Arora, Oja's method and many others. In this paper we see some of these stochastic approaches to the PCA optimization problem and comment on their convergence and runtime to obtain an epsilon sub-optimal solution. We revisit convex relaxation based methods for stochastic optimization of principal component analysis. While methods that directly solve the non convex problem have been shown to be optimal in terms of statistical and computational efficiency, the methods based on convex relaxation have been shown to enjoy comparable, or even superior, empirical performance. This motivates the need for a deeper formal understanding of the latter.
PCA (principal component analysis) and its variants are ubiquitous techniques for matrix dimension reduction and reduced-dimension latent-factor extraction. For an arbitrary matrix, they cannot, on their own, determine the size of the reduced dimension, but rather must be given this as an input. NML (normalized maximum likelihood) is a universal implementation of the Minimal Description Length principle, which gives an objective compression-based criterion for model selection. This work applies NML to PCA. A direct attempt to do so would involve the distributions of singular values of random matrices, which is difficult. A reduction to linear regression with a noisy unitary covariate matrix, however, allows finding closed-form bounds on the NML of PCA.