Dimensionality Reduction
Dimensionality reduction with missing values imputation
Gahar, Rania Mkhinini, Arfaoui, Olfa, Hidri, Minyar Sassi, Alouane, Nejib Ben-Hadj
For about thirty years, data analysis methods have largely demonstrated their effectiveness in the processing of data in many fields. Data reduction is one of these methods and part of the descriptive (or exploratory) statistics. It tries to summarize a sample of data using graphs or numerical characteristics. The main interpretation of data reduction is reducing the number of dimensions. This implies that data reduction is part of the multivariate exploratory statistics which seek to reduce the number of data dimensions by extracting a number of factors, dimensions, clusters, etc., which explain the dispersion of (multidimensional) data.
Comparison among dimensionality reduction techniques based on Random Projection for cancer classification
Xie, Haozhe, Li, Jie, Zhang, Qiaosheng, Wang, Yadong
Random Projection (RP) technique has been widely applied in many scenarios because it can reduce high-dimensional features into low-dimensional space within short time and meet the need of real-time analysis of massive data. There is an urgent need of dimensionality reduction with fast increase of big genomics data. However, the performance of RP is usually lower. We attempt to improve classification accuracy of RP through combining other reduction dimension methods such as Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Feature Selection (FS). We compared classification accuracy and running time of different combination methods on three microarray datasets and a simulation dataset. Experimental results show a remarkable improvement of 14.77% in classification accuracy of FS followed by RP compared to RP on BC-TCGA dataset. LDA followed by RP also helps RP to yield a more discriminative subspace with an increase of 13.65% on classification accuracy on the same dataset. FS followed by RP outperforms other combination methods in classification accuracy on most of the datasets.
The Informativeness of $k$-Means and Dimensionality Reduction for Learning Mixture Models
Liu, Zhaoqiang, Tan, Vincent Y. F.
The learning of mixture models can be viewed as a clustering problem. Indeed, given data samples independently generated from a mixture of distributions, we often would like to find the correct target clustering of the samples according to which component distribution they were generated from. For a clustering problem, practitioners often choose to use the simple k-means algorithm. k-means attempts to find an optimal clustering which minimizes the sum-of-squared distance between each point and its cluster center. In this paper, we provide sufficient conditions for the closeness of any optimal clustering and the correct target clustering assuming that the data samples are generated from a mixture of log-concave distributions. Moreover, we show that under similar or even weaker conditions on the mixture model, any optimal clustering for the samples with reduced dimensionality is also close to the correct target clustering. These results provide intuition for the informativeness of k-means (with and without dimensionality reduction) as an algorithm for learning mixture models. We verify the correctness of our theorems using numerical experiments and demonstrate using datasets with reduced dimensionality significant speed ups for the time required to perform clustering.
Linear Dimensionality Reduction in Linear Time: Johnson-Lindenstrauss-type Guarantees for Random Subspace
Randomized dimensionality reduction techniques, such as random projection (RP) [7, 15] and Ho's random subspace method (RS) [12] are popular approaches for data compression, with many empirical studies showing the utility of both for machine learning and data mining tasks in practice [26, 11, 21, 19, 18, 27]. For RP a key theoretical motivation behind their use is the Johnson-Lindenstrauss lemma (JLL), the usual constructive proof of which also implies an algorithm with high-probability geometry preservation guarantees for projected data. However RP is costly to apply to large or high-dimensional datasets since it requires a matrix-matrix multiplication to implement the projection, and furthermore the projected features may be hard to interpret. On the other hand RS is a particularly appealing approach for dimensionality reduction because it involves simply selecting a subset of data feature indices randomly without replacement, and so does not require a matrix-matrix multiplication to implement the projection and it retains (a subset of) the original features. RS is therefore computationally far more efficient in practice, and more interpretable than RP, but there is little theory to explain its effectiveness. Focusing on this latter problem, here we prove data-dependent norm-preservation guarantees for data projected onto a random subset of the data features.
Seven Techniques for Data Dimensionality Reduction
The recent explosion of data set size, in number of records and attributes, has triggered the development of a number of big data platforms as well as parallel data analytics algorithms. At the same time though, it has pushed for usage of data dimensionality reduction procedures. Indeed, more is not always better. Large amounts of data might sometimes produce worse performances in data analytics applications. One of my most recent projects happened to be about churn prediction and to use the 2009 KDD Challenge large data set.
Non-Redundant Spectral Dimensionality Reduction
Spectral dimensionality reduction algorithms are widely used in numerous domains, including for recognition, segmentation, tracking and visualization. However, despite their popularity, these algorithms suffer from a major limitation known as the "repeated Eigen-directions" phenomenon. That is, many of the embedding coordinates they produce typically capture the same direction along the data manifold. This leads to redundant and inefficient representations that do not reveal the true intrinsic dimensionality of the data. In this paper, we propose a general method for avoiding redundancy in spectral algorithms. Our approach relies on replacing the orthogonality constraints underlying those methods by unpredictability constraints. Specifically, we require that each embedding coordinate be unpredictable (in the statistical sense) from all previous ones. We prove that these constraints necessarily prevent redundancy, and provide a simple technique to incorporate them into existing methods. As we illustrate on challenging high-dimensional scenarios, our approach produces significantly more informative and compact representations, which improve visualization and classification tasks.
Joint Semi-supervised RSS Dimensionality Reduction and Fingerprint Based Algorithm for Indoor Localization
Zhou, Caifa, Ma, Lin, Tan, Xuezhi
With the recent development in mobile computing devices and as the ubiquitous deployment of access points(APs) of Wireless Local Area Networks(WLANs), WLAN based indoor localization systems(WILSs) are of mounting concentration and are becoming more and more prevalent for they do not require additional infrastructure. As to the localization methods in WILSs, for the approaches used to localization in satellite based global position systems are difficult to achieve in indoor environments, fingerprint based localization algorithms(FLAs) are predominant in the RSS based schemes. However, the performance of FLAs has close relationship with the number of APs and the number of reference points(RPs) in WILSs, especially as the redundant deployment of APs and RPs in the system. There are two fatal problems, curse of dimensionality (CoD) and asymmetric matching(AM), caused by increasing number of APs and breaking down APs during online stage. In this paper, a semi-supervised RSS dimensionality reduction algorithm is proposed to solve these two dilemmas at the same time and there are numerous analyses about the theoretical realization of the proposed method. Another significant innovation of this paper is jointing the fingerprint based algorithm with CM-SDE algorithm to improve the localization accuracy of indoor localization.
Data Science with Python & R: Dimensionality Reduction and Clustering
An important step in data analysis is data exploration and representation. In this tutorial we will see how by combining a technique called Principal Component Analysis (PCA) together with Cluster Analysis we can represent in a two-dimensional space data defined in a higher dimensional one while, at the same time, being able to group this data in similar groups or clusters and find hidden relationships in our data. More concretely, PCA reduces data dimensionality by finding principal components. These are the directions of maximum variation in a dataset. By reducing a dataset original features or variables to a reduced set of new ones based on the principal components, we end up with the minimum number of variables that keep the maximum amount of variation or information about how the data is distributed. If we end up with just two of these new variables, we will be able to represent each sample in our data in a two-dimensional chart (e.g. a scatterplot). As an unsupervised data analysis technique, clustering organises data samples by proximity based on its variables.
Dimensionality Reduction of Massive Sparse Datasets Using Coresets
Feldman, Dan, Volkov, Mikhail, Rus, Daniela
In this paper we present a practical solution with performance guarantees to the problem of dimensionality reduction for very large scale sparse matrices. We show applications of our approach to computing the Principle Component Analysis (PCA) of any $n\times d$ matrix, using one pass over the stream of its rows. Our solution uses coresets: a scaled subset of the $n$ rows that approximates their sum of squared distances to \emph{every} $k$-dimensional \emph{affine} subspace. An open theoretical problem has been to compute such a coreset that is independent of both $n$ and $d$. An open practical problem has been to compute a non-trivial approximation to the PCA of very large but sparse databases such as the Wikipedia document-term matrix in a reasonable time. We answer both of these questions affirmatively. Our main technical result is a new framework for deterministic coreset constructions based on a reduction to the problem of counting items in a stream.
Large Margin Discriminant Dimensionality Reduction in Prediction Space
Saberian, Mohammad, Pereira, Jose Costa, Xu, Can, Yang, Jian, Nvasconcelos, Nuno
In this paper we establish a duality between boosting and SVM, and use this to derive a novel discriminant dimensionality reduction algorithm. In particular, using the multiclass formulation of boosting and SVM we note that both use a combination of mapping and linear classification to maximize the multiclass margin. In SVM this is implemented using a predefined mapping (induced by the kernel) and optimizing the linear classifiers. In boosting the linear classifiers are predefined and the mapping (predictor) is learned through a combination of weak learners. We argue that the intermediate mapping, i.e. boosting predictor, is preserving the discriminant aspects of the data and that by controlling the dimension of this mapping it is possible to obtain discriminant low dimensional representations for the data. We use the aforementioned duality and propose a new method, Large Margin Discriminant Dimensionality Reduction (LADDER) that jointly learns the mapping and the linear classifiers in an efficient manner. This leads to a data-driven mapping which can embed data into any number of dimensions. Experimental results show that this embedding can significantly improve performance on tasks such as hashing and image/scene classification.