Goto

Collaborating Authors

 Learning in High Dimensional Spaces


Building Models for Biopathway Dynamics Using Intrinsic Dimensionality Analysis

arXiv.org Machine Learning

An important task for many if not all the scientific domains is efficient knowledge integration, testing and codification. It is often solved with model construction in a controllable computational environment. In spite of that, the throughput of in-silico simulation-based observations become similarly intractable for thorough analysis. This is especially the case in molecular biology, which served as a subject for this study. In this project, we aimed to test some approaches developed to deal with the curse of dimensionality. Among these we found dimension reduction techniques especially appealing. They can be used to identify irrelevant variability and help to understand critical processes underlying high-dimensional datasets. Additionally, we subjected our data sets to nonlinear time series analysis, as those are well established methods for results comparison. To investigate the usefulness of dimension reduction methods, we decided to base our study on a concrete sample set. The example was taken from the domain of systems biology concerning dynamic evolution of sub-cellular signaling. Particularly, the dataset relates to the yeast pheromone pathway and is studied in-silico with a stochastic model. The model reconstructs signal propagation stimulated by a mating pheromone. In the paper, we elaborate on the reason of multidimensional analysis problem in the context of molecular signaling, and next, we introduce the model of choice, simulation details and obtained time series dynamics. A description of used methods followed by a discussion of results and their biological interpretation finalize the paper.


Continuum directions for supervised dimension reduction

arXiv.org Machine Learning

Dimension reduction of multivariate data supervised by auxiliary information is considered. A series of basis for dimension reduction is obtained as minimizers of a novel criterion. The proposed method is akin to continuum regression, and the resulting basis is called continuum directions. With a presence of binary supervision data, these directions continuously bridge the principal component, mean difference and linear discriminant directions, thus ranging from unsupervised to fully supervised dimension reduction. High-dimensional asymptotic studies of continuum directions for binary supervision reveal several interesting facts. The conditions under which the sample continuum directions are inconsistent, but their classification performance is good, are specified. While the proposed method can be directly used for binary and multi-category classification, its generalizations to incorporate any form of auxiliary data are also presented. The proposed method enjoys fast computation, and the performance is better or on par with more computer-intensive alternatives. Keywords: continuum regression, dimension reduction, linear discriminant analysis, high-dimension, low-sample-size (HDLSS), maximum data piling, principal component analysis 2000 MSC: 60K35 1. Introduction In modern complex data, it becomes increasingly common that multiple data sets are available. Two types of data are collected on a same set of subjects: a data set of primary interestX and an auxiliary data setY . The goal of supervised dimension reduction is to delineate major signals inX, dependent toY . Relevant application areas include genomics (genetic studies collect both gene expression and SNP data--Li et al. (2016)), finance data (stocks asX in relation to characteristicsY of each stock: size, value, momentum and volatility--Connor et al. (2012)), and batch effect adjustments (Lee et al., 2014). There has been a number of work in dealing with the multi-source data situation. Lock et al. (2013) developed JIVE to separate joint variation from individual variations. Large-scale correlation studies can identify millions of pairwise associations between two data sets via multiple canonical correlation analysis (Witten and Tibshirani, 2009). These methods, however, do not provide supervised dimension reduction of a particular data setX, since all data sets assume an equal role. In contrast, reduced-rank regression (RRR, Izenman, 1975; Tso, 1981) and envelop models (Cook et al., 2010) provide sufficient dimension reduction (Cook and Ni, 2005) for regression problems. See Cook et al. (2013) for connections between envelops and partial least square regression.


Scalable Algorithms for Learning High-Dimensional Linear Mixed Models

arXiv.org Machine Learning

Linear mixed models (LMMs) are used extensively to model dependecies of observations in linear regression and are used extensively in many application areas. Parameter estimation for LMMs can be computationally prohibitive on big data. State-of-the-art learning algorithms require computational complexity which depends at least linearly on the dimension p of the covariates, and often use heuristics that do not offer theoretical guarantees. We present scalable algorithms for learning high-dimensional LMMs with sublinear computational complexity dependence on p. Key to our approach are novel dual estimators which use only kernel functions of the data, and fast computational techniques based on the subsampled randomized Hadamard transform. We provide theoretical guarantees for our learning algorithms, demonstrating the robustness of parameter estimation. Finally, we complement the theory with experiments on large synthetic and real data.


Extreme Dimension Reduction for Handling Covariate Shift

arXiv.org Machine Learning

In the covariate shift learning scenario, the training and test covariate distributions differ, so that a predictor's average loss over the training and test distributions also differ. In this work, we explore the potential of extreme dimension reduction, i.e. to very low dimensions, in improving the performance of importance weighting methods for handling covariate shift, which fail in high dimensions due to potentially high train/test covariate divergence and the inability to accurately estimate the requisite density ratios. We first formulate and solve a problem optimizing over linear subspaces a combination of their predictive utility and train/test divergence within. Applying it to simulated and real data, we show extreme dimension reduction helps sometimes but not always, due to a bias introduced by dimension reduction.


[R] UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction • r/MachineLearning

@machinelearnbot

Can I ask you a dumb question? I was thinking about dimensionality reduction the other day and an idea occurred to me: why not just use an autoencoder NN squeezing input data into d dimensions (d 2, 3, ...) and an appropriate loss function to mimic either PCA or t-SNE, or maybe even UMAP would work? This produces a scalable, incremental (approximate) algorithm that easily supports parallelisation. Besides being slower than a pure C/C implementation, do you see something wrong with it?


UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

arXiv.org Machine Learning

Dimension reduction seeks to produce a low dimensional representation of high dimensional data that preserves relevant structure (relevance often being application dependent). Dimension reduction is an important problem in data science for both visualization, and as a potential pre-processing step for machine learning. As a fundamental technique for both visualization and preprocessing, dimension reduction is being applied in a broadening range of fields and on ever increasing sizes of datasets. It is thus desirable to have an algorithm that is both scalable to massive data and able to cope with the diversity of data available. Dimension reduction algorithms tend to fall into two categories; those that seek to preserve the distance structure within the data or those that favor the preservation of local distances over global distance.


Beginners Guide To Learn Dimension Reduction Techniques

@machinelearnbot

This powerful quote by William Shakespeare applies well to techniques used in data science & analytics as well. Allow me to prove it using a short story. In May ' 2015, we conducted a Data Hackathon ( a data science competition) in Delhi-NCR, India. We gave participants the challenge to identify Human Activity Recognition Using Smartphones Data Set. The data set had 561 variables for training model used for the identification of Human activity in test data set.


Dimension Reduction Using Active Manifolds

arXiv.org Machine Learning

Scientists and engineers rely on accurate mathematical models to quantify the objects of their studies, which are often high-dimensional. Unfortunately, high-dimensional models are inherently difficult, i.e. when observations are sparse or expensive to determine. One way to address this problem is to approximate the original model with fewer input dimensions. Our project goal was to recover a function f that takes n inputs and returns one output, where n is potentially large. For any given n-tuple, we assume that we can observe a sample of the gradient and output of the function but it is computationally expensive to do so. This project was inspired by an approach known as Active Subspaces, which works by linearly projecting to a linear subspace where the function changes most on average. Our research gives mathematical developments informing a novel algorithm for this problem. Our approach, Active Manifolds, increases accuracy by seeking nonlinear analogues that approximate the function. The benefits of our approach are eliminated unprincipled parameter, choices, guaranteed accessible visualization, and improved estimation accuracy.


Approximation of Functions over Manifolds: A Moving Least-Squares Approach

arXiv.org Machine Learning

We present an algorithm for approximating a function defined over a $d$-dimensional manifold utilizing only noisy function values at locations sampled from the manifold with noise. To produce the approximation we do not require any knowledge regarding the manifold other than its dimension $d$. The approximation scheme is based upon the Manifold Moving Least-Squares (MMLS). The proposed algorithm is resistant to noise in both the domain and function values. Furthermore, the approximant is shown to be smooth and of approximation order of $\mathcal{O}(h^{m+1})$ for non-noisy data, where $h$ is the mesh size with respect to the manifold domain, and $m$ is the degree of a local polynomial approximation utilized in our algorithm. In addition, the proposed algorithm is linear in time with respect to the ambient-space's dimension. Thus, in case of extremely large ambient space dimension, we are able to avoid the curse of dimensionality without having to perform non-linear dimension reduction, which introduces distortions to the manifold data. Using numerical experiments, we compare the presented method to state-of-the-art algorithms for regression over manifolds and show its potential.


Wisdom of the crowd from unsupervised dimension reduction

arXiv.org Machine Learning

Wisdom of the crowd, the collective intelligence derived from responses of multiple human or machine individuals to the same questions, can be more accurate than each individual, and improve social decision-making and prediction accuracy. This can also integrate multiple programs or datasets, each as an individual, for the same predictive questions. Crowd wisdom estimates each individual's independent error level arising from their limited knowledge, and finds the crowd consensus that minimizes the overall error. However, previous studies have merely built isolated, problem-specific models with limited generalizability, and mainly for binary (yes/no) responses. Here we show with simulation and real-world data that the crowd wisdom problem is analogous to one-dimensional unsupervised dimension reduction in machine learning. This provides a natural class of crowd wisdom solutions, such as principal component analysis and Isomap, which can handle binary and also continuous responses, like confidence levels, and consequently can be more accurate than existing solutions. They can even outperform supervised-learning-based collective intelligence that is calibrated on historical performance of individuals, e.g. penalized linear regression and random forest. This study unifies crowd wisdom and unsupervised dimension reduction, and thereupon introduces a broad range of highly-performing and widely-applicable crowd wisdom methods. As the costs for data acquisition and processing rapidly decrease, this study will promote and guide crowd wisdom applications in the social and natural sciences, including data fusion, meta-analysis, crowd-sourcing, and committee decision making.