Dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. It can be divided into feature selection (find a subset of the original variables) and feature extraction (transform the data in the high-dimensional space to a space of fewer dimensions). (Wikipedia)
Abstract: Dimension reduction is an important tool for analyzing high-dimensional data. The predictor envelope is a method of dimension reduction for regression that assumes certain linear combinations of the predictors are immaterial to the regression. The method can result in substantial gains in estimation efficiency and prediction accuracy over traditional maximum likelihood and least squares estimates. While predictor envelopes have been developed and studied for independent data, no work has been done adapting predictor envelopes to spatial data. In this work, the predictor envelope is adapted to a popular spatial model to form the spatial predictor envelope (SPE).
In this paper, we consider the problem of non-linear dimensionality reduction under uncertainty, both from a theoretical and algorithmic perspectives. Since real-world data usually contain measurements with uncertainties and artifacts, the input space in the proposed framework consists of probability distributions to model the uncertainties associated with each sample. We propose a new dimensionality reduction framework, called NGEU, which leverages uncertainty information and directly extends several traditional approaches, e.g., KPCA, MDA/KMFA, to receive as inputs the probability distributions instead of the original data. We show that the proposed NGEU formulation exhibits a global closed-form solution, and we analyze, based on the Rademacher complexity, how the underlying uncertainties theoretically affect the generalization ability of the framework. Empirical results on different datasets show the effectiveness of the proposed framework.
Graph Neural Networks (GNNs) have become a popular approach for various applications, ranging from social network analysis to modeling chemical properties of molecules. While GNNs often show remarkable performance on public datasets, they can struggle to learn long-range dependencies in the data due to over-smoothing and over-squashing tendencies. To alleviate this challenge, we propose PCAPass, a method which combines Principal Component Analysis (PCA) and message passing for generating node embeddings in an unsupervised manner and leverages gradient boosted decision trees for classification tasks. We show empirically that this approach provides competitive performance compared to popular GNNs on node classification benchmarks, while gathering information from longer distance neighborhoods. Our research demonstrates that applying dimensionality reduction with message passing and skip connections is a promising mechanism for aggregating long-range dependencies in graph structured data.
As data scientists, we often work with high-dimensional data with more than 3 features, or dimensions, of interest. In supervised machine learning, we may use this data for training and classification for example and may reduce the dimensions to speed up the training. In unsupervised learning, we use this type of data for visualization and clustering. In single-cell RNA sequencing (scRNA-seq), for example, we accumulate measurements of tens of thousands of genes per cell for upwards of a million cells. That's a lot of data that provides a window into the cell's identity, state, and other properties.
Existing explanation methods for black-box supervised learning models generally work by building local models that explain the models behaviour for a particular data item. It is possible to make global explanations, but the explanations may have low fidelity for complex models. Most of the prior work on explainable models has been focused on classification problems, with less attention on regression. We propose a new manifold visualization method, SLISEMAP, that at the same time finds local explanations for all of the data items and builds a two-dimensional visualization of model space such that the data items explained by the same model are projected nearby. We provide an open source implementation of our methods, implemented by using GPU-optimized PyTorch library. SLISEMAP works both on classification and regression models. We compare SLISEMAP to most popular dimensionality reduction methods and some local explanation methods. We provide mathematical derivation of our problem and show that SLISEMAP provides fast and stable visualizations that can be used to explain and understand black box regression and classification models.
Abstract: Dimensionality reduction methods have found vast application as visualization tools in diverse areas of science. Although many different methods exist, their performance is often insufficient for providing quick insight into many contemporary datasets, and the unsupervised mode of use prevents the users from utilizing the methods for dataset exploration and finetuning the details for improved visualization quality. BlosSOM builds on a GPUaccelerated implementation of the EmbedSOM algorithm, complemented by several landmarkbased algorithms for interfacing the unsupervised model learning algorithms with the user supervision. We show the application of BlosSOM on realistic datasets, where it helps to produce high-quality visualizations that incorporate user-specified layout and focus on certain features. We believe the semi-supervised dimensionality reduction will improve the data visualization possibilities for science areas such as single-cell cytometry, and provide a fast and efficient base methodology for new directions in dataset exploration and annotation. Dimensionality reduction algorithms emerged as indispensable utilities that enable various forms of intuitive data visualization, providing insight that in turn simplifies rigorous data analysis. Various algorithms have been proposed for graphs and high-dimensional point-cloud data, and many different types of datasets that can be represented with a graph structure or embedded into vector spaces. Performance of the non-linear dimensionality reduction algorithms becomes a concern if the analysis pipeline is required to scale or when the results are required in a limited amount of time such as in clinical settings. The most popular methods, typically based on neighborhood embedding computed by stochastic descent, force-based layouting or neural autoencoders, reach applicability limits when the dataset size is too large. To tackle the limitations, we have previously developed EmbedSOM , a dimensionality reduction and visualization algorithm based on self-organizing maps (SOMs) . EmbedSOM provided an order-of-magnitude speedup on datasets typical for the single-cell cytometry data visualization while retaining competitive quality of the results. The concept has proven useful for interactive and high-performance workflows in cytometry [16, 14], and easily applies to many other types of datasets.
Normalizing flows are diffeomorphic, typically dimension-preserving, models trained using the likelihood of the model. We use the SurVAE framework to construct dimension reducing surjective flows via a new layer, known as the funnel. We demonstrate its efficacy on a variety of datasets, and show it improves upon or matches the performance of existing flows while having a reduced latent space size. The funnel layer can be constructed from a wide range of transformations including restricted convolution and feed forward layers.
Hyperbolic neural networks have been popular in the recent past due to their ability to represent hierarchical data sets effectively and efficiently. The challenge in developing these networks lies in the nonlinearity of the embedding space namely, the Hyperbolic space. Hyperbolic space is a homogeneous Riemannian manifold of the Lorentz group. Most existing methods (with some exceptions) use local linearization to define a variety of operations paralleling those used in traditional deep neural networks in Euclidean spaces. In this paper, we present a novel fully hyperbolic neural network which uses the concept of projections (embeddings) followed by an intrinsic aggregation and a nonlinearity all within the hyperbolic space. The novelty here lies in the projection which is designed to project data on to a lower-dimensional embedded hyperbolic space and hence leads to a nested hyperbolic space representation independently useful for dimensionality reduction. The main theoretical contribution is that the proposed embedding is proved to be isometric and equivariant under the Lorentz transformations. This projection is computationally efficient since it can be expressed by simple linear operations, and, due to the aforementioned equivariance property, it allows for weight sharing. The nested hyperbolic space representation is the core component of our network and therefore, we first compare this ensuing nested hyperbolic space representation with other dimensionality reduction methods such as tangent PCA, principal geodesic analysis (PGA) and HoroPCA. Based on this equivariant embedding, we develop a novel fully hyperbolic graph convolutional neural network architecture to learn the parameters of the projection. Finally, we present experiments demonstrating comparative performance of our network on several publicly available data sets.
Data forms the foundation of any machine learning algorithm, without it, Data Science can not happen. Sometimes, it can contain a huge number of features, some of which are not even required. Such redundant information makes modeling complicated. Furthermore, interpreting and understanding the data by visualization gets difficult because of the high dimensionality. This is where dimensionality reduction comes into play. Dimensionality reduction is the task of reducing the number of features in a dataset. In machine learning tasks like regression or classification, there are often too many variables to work with. These variables are also called features.
A successful data scientist understands a wide range of Machine Learning algorithms and can explain the results to stakeholders. But, unfortunately, not every stakeholder has a sufficient amount of training to grasp the complexities of ML. Luckily, we can aid our explanations by using dimensionality reduction techniques to create visual representations of high dimensional data. This article will take you through one such technique called t-Distributed Stochastic Neighbor Embedding (t-SNE). Perfect categorization of Machine Learning techniques is not always possible due to the flexibility demonstrated by specific algorithms, making them useful when solving different problems (e.g., one can use k-NN for regression and classification).