This is a tutorial and survey paper on unification of spectral dimensionality reduction methods, kernel learning by Semidefinite Programming (SDP), Maximum Variance Unfolding (MVU) or Semidefinite Embedding (SDE), and its variants. We first explain how the spectral dimensionality reduction methods can be unified as kernel Principal Component Analysis (PCA) with different kernels. This unification can be interpreted as eigenfunction learning or representation of kernel in terms of distance matrix. Then, since the spectral methods are unified as kernel PCA, we say let us learn the best kernel for unfolding the manifold of data to its maximum variance. We first briefly introduce kernel learning by SDP for the transduction task. Then, we explain MVU in detail. Various versions of supervised MVU using nearest neighbors graph, by class-wise unfolding, by Fisher criterion, and by colored MVU are explained. We also explain out-of-sample extension of MVU using eigenfunctions and kernel mapping. Finally, we introduce other variants of MVU including action respecting embedding, relaxed MVU, and landmark MVU for big data.
Manifold learning is used for dimensionality reduction, with the goal of finding a projection subspace to increase and decrease the inter- and intraclass variances, respectively. However, a bottleneck for subspace learning methods often arises from the high dimensionality of datasets. In this paper, a hierarchical approach is proposed to scale subspace learning methods, with the goal of improving classification in large datasets by a range of 3% to 10%. Different combinations of methods are studied. We assess the proposed method on five publicly available large datasets, for different eigen-value based subspace learning methods such as linear discriminant analysis, principal component analysis, generalized discriminant analysis, and reconstruction independent component analysis. To further examine the effect of the proposed method on various classification methods, we fed the generated result to linear discriminant analysis, quadratic linear analysis, k-nearest neighbor, and random forest classifiers. The resulting classification accuracies are compared to show the effectiveness of the hierarchical approach, reporting results of an average of 5% increase in classification accuracy.
We present a new method which generalizes subspace learning based on eigenvalue and generalized eigenvalue problems. This method, Roweis Discriminant Analysis (RDA), is named after Sam Roweis to whom the field of subspace learning owes significantly. RDA is a family of infinite number of algorithms where Principal Component Analysis (PCA), Supervised PCA (SPCA), and Fisher Discriminant Analysis (FDA) are special cases. One of the extreme special cases, which we name Double Supervised Discriminant Analysis (DSDA), uses the labels twice; it is novel and has not appeared elsewhere. We propose a dual for RDA for some special cases. We also propose kernel RDA, generalizing kernel PCA, kernel SPCA, and kernel FDA, using both dual RDA and representation theory. Our theoretical analysis explains previously known facts such as why SPCA can use regression but FDA cannot, why PCA and SPCA have duals but FDA does not, why kernel PCA and kernel SPCA use kernel trick but kernel FDA does not, and why PCA is the best linear method for reconstruction. Roweisfaces and kernel Roweisfaces are also proposed generalizing eigenfaces, Fisherfaces, supervised eigenfaces, and their kernel variants. We also report experiments showing the effectiveness of RDA and kernel RDA on some benchmark datasets.
High-dimensional data in many machine learning applications leads to computational and analytical complexities. Feature selection provides an effective way for solving these problems by removing irrelevant and redundant features, thus reducing model complexity and improving accuracy and generalization capability of the model. In this paper, we present a novel teacher-student feature selection (TSFS) method in which a 'teacher' (a deep neural network or a complicated dimension reduction method) is first employed to learn the best representation of data in low dimension. Then a 'student' network (a simple neural network) is used to perform feature selection by minimizing the reconstruction error of low dimensional representation. Although the teacher-student scheme is not new, to the best of our knowledge, it is the first time that this scheme is employed for feature selection. The proposed TSFS can be used for both supervised and unsupervised feature selection. This method is evaluated on different datasets and is compared with state-of-the-art existing feature selection methods. The results show that TSFS performs better in terms of classification and clustering accuracies and reconstruction error. Moreover, experimental evaluations demonstrate a low degree of sensitivity to parameter selection in the proposed method.
Applications in various research fields have developed different domain-specific methods for feature learning and subsequent supervised model training [24, 26, 28]. Many exploratory applications in practice are further characterized by high-dimensional feature representations where the dimensionality reduction problem is to be addressed. One traditional approach towards supervised dimensionality reduction is feature selection, referring to the process of selecting the most class-informative subset from the high-dimensional feature set and discarding others . Particularly, feature selection based on information theoretic criteria (e.g., maximum mutual information) have shown significant promise in earlier studies [2, 25]. Although selecting a class-relevant subset of features leads to intuitively interpretable and preferable learning algorithms, feature ranking and selection algorithms are known to potentially yield sub-optimal solutions due to their inability to thoroughly assess feature dependencies [10, 44]. In that regard, feature transformation based dimensionality reduction methods provide a more robust alternative , which have been also studied in the form of information theoretic projections or rotations [11, 19, 43].