Seven Techniques for Data Dimensionality Reduction


The recent explosion of data set size, in number of records and attributes, has triggered the development of a number of big data platforms as well as parallel data analytics algorithms. At the same time though, it has pushed for usage of data dimensionality reduction procedures. Indeed, more is not always better. Large amounts of data might sometimes produce worse performances in data analytics applications. One of my most recent projects happened to be about churn prediction and to use the 2009 KDD Challenge large data set.

Sparse Embedded $k$-Means Clustering

Neural Information Processing Systems

The $k$-means clustering algorithm is a ubiquitous tool in data mining and machine learning that shows promising performance. However, its high computational cost has hindered its applications in broad domains. Researchers have successfully addressed these obstacles with dimensionality reduction methods. Recently, [1] develop a state-of-the-art random projection (RP) method for faster $k$-means clustering. Their method delivers many improvements over other dimensionality reduction methods. For example, compared to the advanced singular value decomposition based feature extraction approach, [1] reduce the running time by a factor of $\min \{n,d\}\epsilon^2 log(d)/k$ for data matrix $X \in \mathbb{R}^{n\times d} $ with $n$ data points and $d$ features, while losing only a factor of one in approximation accuracy. Unfortunately, they still require $\mathcal{O}(\frac{ndk}{\epsilon^2log(d)})$ for matrix multiplication and this cost will be prohibitive for large values of $n$ and $d$. To break this bottleneck, we carefully build a sparse embedded $k$-means clustering algorithm which requires $\mathcal{O}(nnz(X))$ ($nnz(X)$ denotes the number of non-zeros in $X$) for fast matrix multiplication. Moreover, our proposed algorithm improves on [1]'s results for approximation accuracy by a factor of one. Our empirical studies corroborate our theoretical findings, and demonstrate that our approach is able to significantly accelerate $k$-means clustering, while achieving satisfactory clustering performance.

Spectral Overlap and a Comparison of Parameter-Free, Dimensionality Reduction Quality Metrics Machine Learning

Nonlinear dimensionality reduction methods are a popular tool for data scientists and researchers to visualize complex, high dimensional data. However, while these methods continue to improve and grow in number, it is often difficult to evaluate the quality of a visualization due to a variety of factors such as lack of information about the intrinsic dimension of the data and additional tuning required for many evaluation metrics. In this paper, we seek to provide a systematic comparison of dimensionality reduction quality metrics using datasets where we know the ground truth manifold. We utilize each metric for hyperparameter optimization in popular dimensionality reduction methods used for visualization and provide quantitative metrics to objectively compare visualizations to their original manifold. In our results, we find a few methods that appear to consistently do well and propose the best performer as a benchmark for evaluating dimensionality reduction based visualizations.

Non-I.I.D. Multi-Instance Dimensionality Reduction by Learning a Maximum Bag Margin Subspace

AAAI Conferences

Multi-instance learning, as other machine learning tasks, also suffers from the curse of dimensionality. Although dimensionality reduction methods have been investigated for many years, multi-instance dimensionality reduction methods remain untouched. On the other hand, most algorithms in multi- instance framework treat instances in each bag as independently and identically distributed samples, which fails to utilize the structure information conveyed by instances in a bag. In this paper, we propose a multi-instance dimensionality reduction method, which treats instances in each bag as non-i.i.d. samples. We regard every bag as a whole entity and define a bag margin objective function. By maximizing the margin of positive and negative bags, we learn a subspace to obtain more salient representation of original data. Experiments demonstrate the effectiveness of the proposed method.

IT-map: an Effective Nonlinear Dimensionality Reduction Method for Interactive Clustering Machine Learning

Scientists in many fields have the common and basic need of dimensionality reduction: visualizing the underlying structure of the massive multivariate data in a low-dimensional space. However, many dimensionality reduction methods confront the so-called "crowding problem" that clusters tend to overlap with each other in the embedding. Previously, researchers expect to avoid that problem and seek to make clusters maximally separated in the embedding. However, the proposed in-tree (IT) based method, called IT-map, allows clusters in the embedding to be locally overlapped, while seeking to make them distinguishable by some small yet key parts. IT-map provides a simple, effective and novel solution to cluster-preserving mapping, which makes it possible to cluster the original data points interactively and thus should be of general meaning in science and engineering.