Goto

Collaborating Authors

 Dimensionality Reduction


Dimensionality reduction methods for molecular simulations

arXiv.org Machine Learning

Molecular simulations produce very high-dimensional data-sets with millions of data points. As analysis methods are often unable to cope with so many dimensions, it is common to use dimensionality reduction and clustering methods to reach a reduced representation of the data. Yet these methods often fail to capture the most important features necessary for the construction of a Markov model. Here we demonstrate the results of various dimensionality reduction methods on two simulation data-sets, one of protein folding and another of protein-ligand binding. The methods tested include a k-means clustering variant, a non-linear auto encoder, principal component analysis and tICA. The dimension-reduced data is then used to estimate the implied timescales of the slowest process by a Markov state model analysis to assess the quality of the projection. The projected dimensions learned from the data are visualized to demonstrate which conformations the various methods choose to represent the molecular process.


DPCA: Dimensionality Reduction for Discriminative Analytics of Multiple Large-Scale Datasets

arXiv.org Machine Learning

Principal component analysis (PCA) has well-documented merits for data extraction and dimensionality reduction. PCA deals with a single dataset at a time, and it is challenged when it comes to analyzing multiple datasets. Yet in certain setups, one wishes to extract the most significant information of one dataset relative to other datasets. Specifically, the interest may be on identifying, namely extracting features that are specific to a single target dataset but not the others. This paper develops a novel approach for such so-termed discriminative data analysis, and establishes its optimality in the least-squares (LS) sense under suitable data modeling assumptions. The criterion reveals linear combinations of variables by maximizing the ratio of the variance of the target data to that of the remainders. The novel approach solves a generalized eigenvalue problem by performing SVD just once. Numerical tests using synthetic and real datasets showcase the merits of the proposed approach relative to its competing alternatives.


Dimensionality Reduction Ensembles

arXiv.org Machine Learning

Ensemble learning has had many successes in supervised learning, but it has been rare in unsupervised learning and dimensionality reduction. This study explores dimensionality reduction ensembles, using principal component analysis and manifold learning techniques to capture linear, nonlinear, local, and global features in the original dataset. Dimensionality reduction ensembles are tested first on simulation data and then on two real medical datasets using random forest classifiers; results suggest the efficacy of this approach, with accuracies approaching that of the full dataset. Limitations include computational cost of some algorithms with strong performance, which may be ameliorated through distributed computing and the development of more efficient versions of these algorithms.


Simultaneously Learning Neighborship and Projection Matrix for Supervised Dimensionality Reduction

arXiv.org Machine Learning

Explicitly or implicitly, most of dimensionality reduction methods need to determine which samples are neighbors and the similarity between the neighbors in the original highdimensional space. The projection matrix is then learned on the assumption that the neighborhood information (e.g., the similarity) is known and fixed prior to learning. However, it is difficult to precisely measure the intrinsic similarity of samples in high-dimensional space because of the curse of dimensionality. Consequently, the neighbors selected according to such similarity might and the projection matrix obtained according to such similarity and neighbors are not optimal in the sense of classification and generalization. To overcome the drawbacks, in this paper we propose to let the similarity and neighbors be variables and model them in low-dimensional space. Both the optimal similarity and projection matrix are obtained by minimizing a unified objective function. Nonnegative and sum-to-one constraints on the similarity are adopted. Instead of empirically setting the regularization parameter, we treat it as a variable to be optimized. It is interesting that the optimal regularization parameter is adaptive to the neighbors in low-dimensional space and has intuitive meaning. Experimental results on the YALE B, COIL-100, and MNIST datasets demonstrate the effectiveness of the proposed method.


Checking out dimensionality reduction with t-SNE โ€“ Hannah Yan Han โ€“ Medium

#artificialintelligence

You can read all about fashionMNIST here which is set out to be MNIST scaled up in complexity. While MNIST contains handwritten digits from 0 to 9, fashionMNIST contains 10 different kinds of attires from t-shirts to dresses to trousers. I ran t-SNE on the entire original MNIST training set, which is rather well-separated, and compared it with fashionMNIST. And observed some overlapping in fashionMNIST. We can further rotate it in plotly and remove the clearly separately classes to identify the overlapping classes: t-shirt, shirt and coat.


Evaluating Graph Signal Processing for Neuroimaging Through Classification and Dimensionality Reduction

arXiv.org Machine Learning

Graph Signal Processing (GSP) is a promising framework to analyze multi-dimensional neuroimaging datasets, while taking into account both the spatial and functional dependencies between brain signals. In the present work, we apply dimensionality reduction techniques based on graph representations of the brain to decode brain activity from real and simulated fMRI datasets. We introduce seven graphs obtained from a) geometric structure and/or b) functional connectivity between brain areas at rest, and compare them when performing dimension reduction for classification. We show that mixed graphs using both a) and b) offer the best performance. We also show that graph sampling methods perform better than classical dimension reduction including Principal Component Analysis (PCA) and Independent Component Analysis (ICA).


Clustering and Dimensionality Reduction: Understanding the "Magic" Behind Machine Learning โ€“ Blog Imperva

#artificialintelligence

These days we hear about machine learning and artificial intelligence (AI) in all aspects of life. We see machines that learn and imitate the human brain in order to automate human processes. There are autonomous cars that learn the road conditions to drive, personal assistants we can converse with and machines that can predict what stock markets will do. In some respects, it can appear as "magic." Behind machine learning there are some fundamental, well-studied and understood techniques.


Out-of-Sample Extension for Dimensionality Reduction of Noisy Time Series

arXiv.org Machine Learning

This paper proposes an out-of-sample extension framework for a global manifold learning algorithm (Isomap) that uses temporal information in out-of-sample points in order to make the embedding more robust to noise and artifacts. Given a set of noise-free training data and its embedding, the proposed framework extends the embedding for a noisy time series. This is achieved by adding a spatio-temporal compactness term to the optimization objective of the embedding. To the best of our knowledge, this is the first method for out-of-sample extension of manifold embeddings that leverages timing information available for the extension set. Experimental results demonstrate that our out-of-sample extension algorithm renders a more robust and accurate embedding of sequentially ordered image data in the presence of various noise and artifacts when compared to other timing-aware embeddings. Additionally, we show that an out-of-sample extension framework based on the proposed algorithm outperforms the state of the art in eye-gaze estimation.


A Nonlinear Dimensionality Reduction Framework Using Smooth Geodesics

arXiv.org Machine Learning

Existing dimensionality reduction methods are adept at revealing hidden underlying manifolds arising from high-dimensional data and thereby producing a low-dimensional representation. However, the smoothness of the manifolds produced by classic techniques in the presence of noise is not guaranteed. In fact, the embedding generated using such non-smooth, noisy measurements may distort the geometry of the manifold and thereby produce an unfaithful embedding. Herein, we propose a framework for nonlinear dimensionality reduction that generates a manifold in terms of smooth geodesics that is designed to treat problems in which manifold measurements have been corrupted by noise. Our method generates a network structure for given high-dimensional data using a neighborhood search and then produces piecewise linear shortest paths that are defined as geodesics. Then, we fit points in each geodesic by a smoothing spline to emphasize the smoothness. The robustness of this approach for noisy and sparse datasets is demonstrated by the implementation of the method on synthetic and real-world datasets.


Reducing Dimensionality from Dimensionality Reduction Techniques

@machinelearnbot

PCA (Principal Component Analysis) is probably the oldest trick in the book. PCA is well studied and there are numerous ways to get to the same solution, we will talk about two of them here, Eigen decomposition and Singular Value Decomposition (SVD) and then we will implement the SVD way in TensorFlow. From now on, X will be our data matrix, of shape (n, p) where n is the number of examples, and p are the dimensions. So given X, both methods will try to find, in their own way, a way to manipulate and decompose X in a manner that later on we could multiply the decomposed results to represent maximum information in less dimensions. I know I know, sounds horrible but I will spare you most of the math but keep the parts that contribute to the understanding of the method pros and cons.