Dimensionality reduction is a crucial first step for many unsupervised learning tasks including anomaly detection. Autoencoder is a popular mechanism to accomplish the goal of dimensionality reduction. In order to make dimensionality reduction effective for high-dimensional data embedding nonlinear low-dimensional manifold, it is understood that some sort of geodesic distance metric should be used to discriminate the data samples. Inspired by the success of neighborhood aware shortest path based geodesic approximatiors such as ISOMAP, in this work, we propose to use a minimum spanning tree (MST), a graph-based algorithm, to approximate the local neighborhood structure and generate structure-preserving distances among data points. We use this MST-based distance metric to replace the Euclidean distance metric in the embedding function of autoencoders and develop a new graph regularized autoencoder, which outperforms, over 20 benchmark anomaly detection datasets, the plain autoencoder using no regularizer as well as the autoencoders using the Euclidean-based regularizer. We furthermore incorporate the MST regularizer into two generative adversarial networks and find that using the MST regularizer improves the performance of anomaly detection substantially for both generative adversarial networks.
Here, we propose an unsupervised fuzzy rule-based dimensionality reduction method primarily for data visualization. It considers the following important issues relevant to dimensionality reduction-based data visualization: (i) preservation of neighborhood relationships, (ii) handling data on a non-linear manifold, (iii) the capability of predicting projections for new test data points, (iv) interpretability of the system, and (v) the ability to reject test points if required. For this, we use a first-order Takagi-Sugeno type model. We generate rule antecedents using clusters in the input data. In this context, we also propose a new variant of the Geodesic c-means clustering algorithm. We estimate the rule parameters by minimizing an error function that preserves the inter-point geodesic distances (distances over the manifold) as Euclidean distances on the projected space. We apply the proposed method on three synthetic and three real-world data sets and visually compare the results with four other standard data visualization methods. The obtained results show that the proposed method behaves desirably and performs better than or comparable to the methods compared with. The proposed method is found to be robust to the initial conditions. The predictability of the proposed method for test points is validated by experiments. We also assess the ability of our method to reject output points when it should. Then, we extend this concept to provide a general framework for learning an unsupervised fuzzy model for data projection with different objective functions. To the best of our knowledge, this is the first attempt to manifold learning using unsupervised fuzzy modeling.
Visualizing high-dimensional data is an essential task in Data Science and Machine Learning. The Centroid-Encoder (CE) method is similar to the autoencoder but incorporates label information to keep objects of a class close together in the reduced visualization space. CE exploits nonlinearity and labels to encode high variance in low dimensions while capturing the global structure of the data. We present a detailed analysis of the method using a wide variety of data sets and compare it with other supervised dimension reduction techniques, including NCA, nonlinear NCA, t-distributed NCA, t-distributed MCML, supervised UMAP, supervised PCA, Colored Maximum Variance Unfolding, supervised Isomap, Parametric Embedding, supervised Neighbor Retrieval Visualizer, and Multiple Relational Embedding. We empirically show that centroid-encoder outperforms most of these techniques. We also show that when the data variance is spread across multiple modalities, centroid-encoder extracts a significant amount of information from the data in low dimensional space. This key feature establishes its value to use it as a tool for data visualization.
High-dimensional data in many machine learning applications leads to computational and analytical complexities. Feature selection provides an effective way for solving these problems by removing irrelevant and redundant features, thus reducing model complexity and improving accuracy and generalization capability of the model. In this paper, we present a novel teacher-student feature selection (TSFS) method in which a 'teacher' (a deep neural network or a complicated dimension reduction method) is first employed to learn the best representation of data in low dimension. Then a 'student' network (a simple neural network) is used to perform feature selection by minimizing the reconstruction error of low dimensional representation. Although the teacher-student scheme is not new, to the best of our knowledge, it is the first time that this scheme is employed for feature selection. The proposed TSFS can be used for both supervised and unsupervised feature selection. This method is evaluated on different datasets and is compared with state-of-the-art existing feature selection methods. The results show that TSFS performs better in terms of classification and clustering accuracies and reconstruction error. Moreover, experimental evaluations demonstrate a low degree of sensitivity to parameter selection in the proposed method.