Seven Techniques for Data Dimensionality Reduction


The recent explosion of data set size, in number of records and attributes, has triggered the development of a number of big data platforms as well as parallel data analytics algorithms. At the same time though, it has pushed for usage of data dimensionality reduction procedures. Indeed, more is not always better. Large amounts of data might sometimes produce worse performances in data analytics applications. One of my most recent projects happened to be about churn prediction and to use the 2009 KDD Challenge large data set.

Sparse Embedded $k$-Means Clustering

Neural Information Processing Systems

The $k$-means clustering algorithm is a ubiquitous tool in data mining and machine learning that shows promising performance. However, its high computational cost has hindered its applications in broad domains. Researchers have successfully addressed these obstacles with dimensionality reduction methods. Recently, [1] develop a state-of-the-art random projection (RP) method for faster $k$-means clustering. Their method delivers many improvements over other dimensionality reduction methods. For example, compared to the advanced singular value decomposition based feature extraction approach, [1] reduce the running time by a factor of $\min \{n,d\}\epsilon^2 log(d)/k$ for data matrix $X \in \mathbb{R}^{n\times d} $ with $n$ data points and $d$ features, while losing only a factor of one in approximation accuracy. Unfortunately, they still require $\mathcal{O}(\frac{ndk}{\epsilon^2log(d)})$ for matrix multiplication and this cost will be prohibitive for large values of $n$ and $d$. To break this bottleneck, we carefully build a sparse embedded $k$-means clustering algorithm which requires $\mathcal{O}(nnz(X))$ ($nnz(X)$ denotes the number of non-zeros in $X$) for fast matrix multiplication. Moreover, our proposed algorithm improves on [1]'s results for approximation accuracy by a factor of one. Our empirical studies corroborate our theoretical findings, and demonstrate that our approach is able to significantly accelerate $k$-means clustering, while achieving satisfactory clustering performance.

Non-I.I.D. Multi-Instance Dimensionality Reduction by Learning a Maximum Bag Margin Subspace

AAAI Conferences

Multi-instance learning, as other machine learning tasks, also suffers from the curse of dimensionality. Although dimensionality reduction methods have been investigated for many years, multi-instance dimensionality reduction methods remain untouched. On the other hand, most algorithms in multi- instance framework treat instances in each bag as independently and identically distributed samples, which fails to utilize the structure information conveyed by instances in a bag. In this paper, we propose a multi-instance dimensionality reduction method, which treats instances in each bag as non-i.i.d. samples. We regard every bag as a whole entity and define a bag margin objective function. By maximizing the margin of positive and negative bags, we learn a subspace to obtain more salient representation of original data. Experiments demonstrate the effectiveness of the proposed method.

IT-map: an Effective Nonlinear Dimensionality Reduction Method for Interactive Clustering Machine Learning

Scientists in many fields have the common and basic need of dimensionality reduction: visualizing the underlying structure of the massive multivariate data in a low-dimensional space. However, many dimensionality reduction methods confront the so-called "crowding problem" that clusters tend to overlap with each other in the embedding. Previously, researchers expect to avoid that problem and seek to make clusters maximally separated in the embedding. However, the proposed in-tree (IT) based method, called IT-map, allows clusters in the embedding to be locally overlapped, while seeking to make them distinguishable by some small yet key parts. IT-map provides a simple, effective and novel solution to cluster-preserving mapping, which makes it possible to cluster the original data points interactively and thus should be of general meaning in science and engineering.

Quantitative Comparison of Linear and Non-linear Dimensionality Reduction Techniques for Solar Image Archives

AAAI Conferences

This work investigates the applicability of several dimensionality reduction techniques for large scale solar data analysis. Using the first solar domain-specific benchmark dataset that contains images of multiple types of phenomena, we investigate linear and non-linear dimensionality reduction methods in order to reduce our storage costs and maintain an accurate representation of our data in a new vector space. We present a comparative analysis between several dimensionality reduction methods and different numbers of target dimensions by utilizing different classifiers in order to determine the percentage of dimensionality reduction that can be achieved on solar data with said methods, and to discover the method that is the most effective for solar images.