Principal Component Analysis
Principal Component Analysis (PCA) with Python Examples -- Tutorial
When implementing machine learning algorithms, the inclusion of more features might lead to worsening performance issues. Increasing the number of features will not always improve classification accuracy, which is also known as the curse of dimensionality. Hence, we apply dimensionality reduction to improve classification accuracy by selecting the optimal set of lower dimensionality features. Principal component analysis (PCA) is essential for data science, machine learning, data visualization, statistics, and other quantitative fields. It is essential to know about vector, matrix, and transpose matrix, eigenvalues, eigenvectors, and others to understand the concept of dimensionality reduction.
Dimensionality Reduction: Principal Component Analysis
A dataset is made up of a number of features. As long as these features are related in someway to the target and are optimal in number a machine learning model will be able to produce decent results after learning from the data. But if the number of features are high and most of the features do not contribute towards the model's learning then the performance of the model will go down and the time taken to output predictions also increases. The process of reducing the number of dimensions by transforming the original feature space into a subspace is one method of performing dimensionality reduction and Principal Component Analysis (PCA) does this. So let's take a look into the building concepts of PCA.
FAST-PCA: A Fast and Exact Algorithm for Distributed Principal Component Analysis
Gang, Arpita, Bajwa, Waheed U.
Principal Component Analysis (PCA) is a fundamental data preprocessing tool in the world of machine learning. While PCA is often thought of as a dimensionality reduction method, the purpose of PCA is actually two-fold: dimension reduction and uncorrelated feature learning. Furthermore, the enormity of the dimensions and sample size in the modern day datasets have rendered the centralized PCA solutions unusable. In that vein, this paper reconsiders the problem of PCA when data samples are distributed across nodes in an arbitrarily connected network. While a few solutions for distributed PCA exist, those either overlook the uncorrelated feature learning aspect of the PCA, tend to have high communication overhead that makes them inefficient and/or lack `exact' or `global' convergence guarantees. To overcome these aforementioned issues, this paper proposes a distributed PCA algorithm termed FAST-PCA (Fast and exAct diSTributed PCA). The proposed algorithm is efficient in terms of communication and is proven to converge linearly and exactly to the principal components, leading to dimension reduction as well as uncorrelated features. The claims are further supported by experimental results.
A Coupled CP Decomposition for Principal Components Analysis of Symmetric Networks
Weylandt, Michael, Michailidis, George
In a number of application domains, one observes a sequence of network data; for example, repeated measurements between users interactions in social media platforms, financial correlation networks over time, or across subjects, as in multi-subject studies of brain connectivity. One way to analyze such data is by stacking networks into a third-order array or tensor. We propose a principal components analysis (PCA) framework for sequence network data, based on a novel decomposition for semi-symmetric tensors. We derive efficient algorithms for computing our proposed "Coupled CP" decomposition and establish estimation consistency of our approach under an analogue of the spiked covariance model with rates the same as the matrix case up to a logarithmic term. Our framework inherits many of the strengths of classical PCA and is suitable for a wide range of unsupervised learning tasks, including identifying principal networks, isolating meaningful changepoints or outliers across observations, and for characterizing the "variability network" of the most varying edges. Finally, we demonstrate the effectiveness of our proposal on simulated data and on examples from political science and financial economics. The proof techniques used to establish our main consistency results are surprisingly straight-forward and may find use in a variety of other matrix and tensor decomposition problems.
Summerville
Procedural Content Generation (PCG) using machine learning is a fast growing area of research. Action Role Playing Game (ARPG) levels represent an interesting challenge for PCG due to their multi-tiered structure and nonlinearity. Previous work has used Bayes Nets (BN) to learn properties of the topological structure of levels from The Legend of Zelda. In this paper we describe a method for sampling these learned distributions to generate valid, playable level topologies. We carry this deeper and learn a sampleable representation of the individual rooms using Principal Component Analysis. We combine the two techniques and present a multi-scale machine learned technique for procedurally generating ARPG levels from a corpus of levels from The Legend of Zelda.
Distributionally Robust Fair Principal Components via Geodesic Descents
Vu, Hieu, Tran, Toan, Yue, Man-Chung, Nguyen, Viet Anh
Principal component analysis is a simple yet useful dimensionality reduction technique in modern machine learning pipelines. In consequential domains such as college admission, healthcare and credit approval, it is imperative to take into account emerging criteria such as the fairness and the robustness of the learned projection. In this paper, we propose a distributionally robust optimization problem for principal component analysis which internalizes a fairness criterion in the objective function. The learned projection thus balances the trade-off between the total reconstruction error and the reconstruction error gap between subgroups, taken in the min-max sense over all distributions in a moment-based ambiguity set. The resulting optimization problem over the Stiefel manifold can be efficiently solved by a Riemannian subgradient descent algorithm with a sub-linear convergence rate. Our experimental results on real-world datasets show the merits of our proposed method over state-of-the-art baselines. Machine learning models are ubiquitous in our daily lives and supporting the decision-making process in diverse domains.
How Principal Component Analysis [PCA] Works
What is PCA: Principal Component Analysis (PCA) is a dimensionality reduction technique that enables you to identify correlations and patterns in a data set so that it can be transferred into a data set of significantly lower dimensions without loss of any important information. Step 1: Standardization of the data Standardization is all about scaling your data in such a way that all the variables and values lie within a similar range. Step 2: Computing the covariance matrix A covariance matrix expresses the correlation between the different variables in the data set. It is essential to identify heavily dependent variables because they contain biased and redundant information which reduces the overall performance of the model. Eigenvectors and Eigenvalues are the mathematical constructs that must be computed from the covariance matrix in order to determine the Principal component of the data sets.
Robust factored principal component analysis for matrix-valued outlier accommodation and detection
Ma, Xuan, Zhao, Jianhua, Wang, Yue
Principal component analysis (PCA) is a popular dimension reduction technique for vector data. Factored PCA (FPCA) is a probabilistic extension of PCA for matrix data, which can substantially reduce the number of parameters in PCA while yield satisfactory performance. However, FPCA is based on the Gaussian assumption and thereby susceptible to outliers. Although the multivariate $t$ distribution as a robust modeling tool for vector data has a very long history, its application to matrix data is very limited. The main reason is that the dimension of the vectorized matrix data is often very high and the higher the dimension, the lower the breakdown point that measures the robustness. To solve the robustness problem suffered by FPCA and make it applicable to matrix data, in this paper we propose a robust extension of FPCA (RFPCA), which is built upon a $t$-type distribution called matrix-variate $t$ distribution. Like the multivariate $t$ distribution, the matrix-variate $t$ distribution can adaptively down-weight outliers and yield robust estimates. We develop a fast EM-type algorithm for parameter estimation. Experiments on synthetic and real-world datasets reveal that RFPCA is compared favorably with several related methods and RFPCA is a simple but powerful tool for matrix-valued outlier detection.
Learnable Faster Kernel-PCA for Nonlinear Fault Detection: Deep Autoencoder-Based Realization
Ren, Zelin, Yang, Xuebing, Jiang, Yuchen, Zhang, Wensheng
Kernel principal component analysis (KPCA) is a well-recognized nonlinear dimensionality reduction method that has been widely used in nonlinear fault detection tasks. As a kernel trick-based method, KPCA inherits two major problems. First, the form and the parameters of the kernel function are usually selected blindly, depending seriously on trial-and-error. As a result, there may be serious performance degradation in case of inappropriate selections. Second, at the online monitoring stage, KPCA has much computational burden and poor real-time performance, because the kernel method requires to leverage all the offline training data. In this work, to deal with the two drawbacks, a learnable faster realization of the conventional KPCA is proposed. The core idea is to parameterize all feasible kernel functions using the novel nonlinear DAE-FE (deep autoencoder based feature extraction) framework and propose DAE-PCA (deep autoencoder based principal component analysis) approach in detail. The proposed DAE-PCA method is proved to be equivalent to KPCA but has more advantage in terms of automatic searching of the most suitable nonlinear high-dimensional space according to the inputs. Furthermore, the online computational efficiency improves by approximately 100 times compared with the conventional KPCA. With the Tennessee Eastman (TE) process benchmark, the effectiveness and superiority of the proposed method is illustrated.