Dimensionality Reduction
CCP: Correlated Clustering and Projection for Dimensionality Reduction
Hozumi, Yuta, Wang, Rui, Wei, Guo-Wei
Most dimensionality reduction methods employ frequency domain representations obtained from matrix diagonalization and may not be efficient for large datasets with relatively high intrinsic dimensions. To address this challenge, Correlated Clustering and Projection (CCP) offers a novel data domain strategy that does not need to solve any matrix. CCP partitions high-dimensional features into correlated clusters and then projects correlated features in each cluster into a one-dimensional representation based on sample correlations. Residue-Similarity (R-S) scores and indexes, the shape of data in Riemannian manifolds, and algebraic topology-based persistent Laplacian are introduced for visualization and analysis. Proposed methods are validated with benchmark datasets associated with various machine learning algorithms.
Distribution Agnostic Symbolic Representations for Time Series Dimensionality Reduction and Online Anomaly Detection
Bountrogiannis, Konstantinos, Tzagkarakis, George, Tsakalides, Panagiotis
Due to the importance of the lower bounding distances and the attractiveness of symbolic representations, the family of symbolic aggregate approximations (SAX) has been used extensively for encoding time series data. However, typical SAX-based methods rely on two restrictive assumptions; the Gaussian distribution and equiprobable symbols. This paper proposes two novel data-driven SAX-based symbolic representations, distinguished by their discretization steps. The first representation, oriented for general data compaction and indexing scenarios, is based on the combination of kernel density estimation and Lloyd-Max quantization to minimize the information loss and mean squared error in the discretization step. The second method, oriented for high-level mining tasks, employs the Mean-Shift clustering method and is shown to enhance anomaly detection in the lower-dimensional space. Besides, we verify on a theoretical basis a previously observed phenomenon of the intrinsic process that results in a lower than the expected variance of the intermediate piecewise aggregate approximation. This phenomenon causes an additional information loss but can be avoided with a simple modification. The proposed representations possess all the attractive properties of the conventional SAX method. Furthermore, experimental evaluation on real-world datasets demonstrates their superiority compared to the traditional SAX and an alternative data-driven SAX variant.
Dimensionality Reduction: Principal Component Analysis
A dataset is made up of a number of features. As long as these features are related in someway to the target and are optimal in number a machine learning model will be able to produce decent results after learning from the data. But if the number of features are high and most of the features do not contribute towards the model's learning then the performance of the model will go down and the time taken to output predictions also increases. The process of reducing the number of dimensions by transforming the original feature space into a subspace is one method of performing dimensionality reduction and Principal Component Analysis (PCA) does this. So let's take a look into the building concepts of PCA.
A New Dimensionality Reduction Method Based on Hensel's Compression for Privacy Protection in Federated Learning
Ouadrhiri, Ahmed El, Abdelhadi, Ahmed
Differential privacy (DP) is considered a de-facto standard for protecting users' privacy in data analysis, machine, and deep learning. Existing DP-based privacy-preserving training approaches consist of adding noise to the clients' gradients before sharing them with the server. However, implementing DP on the gradient is not efficient as the privacy leakage increases by increasing the synchronization training epochs due to the composition theorem. Recently researchers were able to recover images used in the training dataset using Generative Regression Neural Network (GRNN) even when the gradient was protected by DP. In this paper, we propose two layers of privacy protection approach to overcome the limitations of the existing DP-based approaches. The first layer reduces the dimension of the training dataset based on Hensel's Lemma. We are the first to use Hensel's Lemma for reducing the dimension (i.e., compress) of a dataset. The new dimensionality reduction method allows reducing the dimension of a dataset without losing information since Hensel's Lemma guarantees uniqueness. The second layer applies DP to the compressed dataset generated by the first layer. The proposed approach overcomes the problem of privacy leakage due to composition by applying DP only once before the training; clients train their local model on the privacy-preserving dataset generated by the second layer. Experimental results show that the proposed approach ensures strong privacy protection while achieving good accuracy. The new dimensionality reduction method achieves an accuracy of 97%, with only 25 % of the original data size.
5 Papers to Read on Dimensionality Reduction Method in 2022
Abstract: Dimension reduction is an important tool for analyzing high-dimensional data. The predictor envelope is a method of dimension reduction for regression that assumes certain linear combinations of the predictors are immaterial to the regression. The method can result in substantial gains in estimation efficiency and prediction accuracy over traditional maximum likelihood and least squares estimates. While predictor envelopes have been developed and studied for independent data, no work has been done adapting predictor envelopes to spatial data. In this work, the predictor envelope is adapted to a popular spatial model to form the spatial predictor envelope (SPE).
Incorporating Texture Information into Dimensionality Reduction for High-Dimensional Images
Vieth, Alexander, Vilanova, Anna, Lelieveldt, Boudewijn, Eisemann, Elmar, Höllt, Thomas
High-dimensional imaging is becoming increasingly relevant in many fields from astronomy and cultural heritage to systems biology. Visual exploration of such high-dimensional data is commonly facilitated by dimensionality reduction. Consequently, exploration of such data is Figure 1: Texture-aware dimensionality reduction. An image typically split into a step focusing on the attribute space followed by (a) with black and white pixels forms multiple textures. In this paper, distance-based dimensionality reduction produces one cluster of we present a method for incorporating spatial neighborhood information black and one cluster of white pixels (b), a texture-aware version into distance-based dimensionality reduction methods, such as should create clusters for the different textures (c). We achieve this by modifying the distance measure between high-dimensional attribute vectors associated with each pixel such that it takes the pixel's spatial neighborhood into account. Based on a classification The spatial configuration is, however, commonly of interest when of different methods for comparing image patches, we explore a analyzing high-dimensional image data. We compare these approaches from neighborhood information into account, in addition to highdimensional a theoretical and experimental point of view. Typical approaches to combine high-dimensional evaluation on synthetic data and two real-world use cases. They use the embedding as a colormap and perform segmentation on the re-colored image. High-dimensional data is commonly acquired and analyzed in various Decoupling the high-dimensional and spatial analysis in such a application domains, from systems biology [26] to insurance way has several downsides: Most importantly, boundaries between fraud detection [37]. Typically, high-dimensional data are tabular clusters in an embedding are often not well defined, and as such data with many columns (or attributes), corresponding to the dimensionality classification is ambiguous and has a level of arbitrariness.
A Dimensionality Reduction Method for Finding Least Favorable Priors with a Focus on Bregman Divergence
Dytso, Alex, Goldenbaum, Mario, Poor, H. Vincent, Shamai, Shlomo
A common way of characterizing minimax estimators in point estimation is by moving the problem into the Bayesian estimation domain and finding a least favorable prior distribution. The Bayesian estimator induced by a least favorable prior, under mild conditions, is then known to be minimax. However, finding least favorable distributions can be challenging due to inherent optimization over the space of probability distributions, which is infinite-dimensional. This paper develops a dimensionality reduction method that allows us to move the optimization to a finite-dimensional setting with an explicit bound on the dimension. The benefit of this dimensionality reduction is that it permits the use of popular algorithms such as projected gradient ascent to find least favorable priors. Throughout the paper, in order to make progress on the problem, we restrict ourselves to Bayesian risks induced by a relatively large class of loss functions, namely Bregman divergences.
Non-Linear Spectral Dimensionality Reduction Under Uncertainty
Laakom, Firas, Raitoharju, Jenni, Passalis, Nikolaos, Iosifidis, Alexandros, Gabbouj, Moncef
In this paper, we consider the problem of non-linear dimensionality reduction under uncertainty, both from a theoretical and algorithmic perspectives. Since real-world data usually contain measurements with uncertainties and artifacts, the input space in the proposed framework consists of probability distributions to model the uncertainties associated with each sample. We propose a new dimensionality reduction framework, called NGEU, which leverages uncertainty information and directly extends several traditional approaches, e.g., KPCA, MDA/KMFA, to receive as inputs the probability distributions instead of the original data. We show that the proposed NGEU formulation exhibits a global closed-form solution, and we analyze, based on the Rademacher complexity, how the underlying uncertainties theoretically affect the generalization ability of the framework. Empirical results on different datasets show the effectiveness of the proposed framework.
Dimensionality Reduction Meets Message Passing for Graph Node Embeddings
Sadowski, Krzysztof, Szarmach, Michał, Mattia, Eddie
Graph Neural Networks (GNNs) have become a popular approach for various applications, ranging from social network analysis to modeling chemical properties of molecules. While GNNs often show remarkable performance on public datasets, they can struggle to learn long-range dependencies in the data due to over-smoothing and over-squashing tendencies. To alleviate this challenge, we propose PCAPass, a method which combines Principal Component Analysis (PCA) and message passing for generating node embeddings in an unsupervised manner and leverages gradient boosted decision trees for classification tasks. We show empirically that this approach provides competitive performance compared to popular GNNs on node classification benchmarks, while gathering information from longer distance neighborhoods. Our research demonstrates that applying dimensionality reduction with message passing and skip connections is a promising mechanism for aggregating long-range dependencies in graph structured data.
Why you should be using PHATE for dimensionality reduction
As data scientists, we often work with high-dimensional data with more than 3 features, or dimensions, of interest. In supervised machine learning, we may use this data for training and classification for example and may reduce the dimensions to speed up the training. In unsupervised learning, we use this type of data for visualization and clustering. In single-cell RNA sequencing (scRNA-seq), for example, we accumulate measurements of tens of thousands of genes per cell for upwards of a million cells. That's a lot of data that provides a window into the cell's identity, state, and other properties.