Boucher, Thomas (University of Massachusetts (Amherst)) | Carey, CJ (University of Massachusetts (Amherst)) | Mahadevan, Sridhar (University of Massachusetts (Amherst)) | Dyar, Melinda Darby (Mount Holyoke College)
Current manifold alignment methods can effectively align data sets that are drawn from a non-intersecting set of manifolds. However, as data sets become increasingly high-dimensional and complex, this assumption may not hold. This paper proposes a novel manifold alignment algorithm, low rank alignment (LRA), that uses a low rank representation (instead of a nearest neighbor graph construction) to embed and align data sets drawn from mixtures of manifolds. LRA does not require the tuning of a sensitive nearest neighbor hyperparameter or prior knowledge of the number of manifolds, both of which are common drawbacks with existing techniques. We demonstrate the effectiveness of our algorithm in two real-world applications: a transfer learning task in spectroscopy and a canonical information retrieval task.
The aim of manifold learning is to extract low-dimensional manifolds from high-dimensional data. Manifold alignment is a variant of manifold learning that uses two or more datasets that are assumed to represent different high-dimensional representations of the same underlying manifold. Manifold alignment can be successful in detecting latent manifolds in cases where one version of the data alone is not sufficient to extract and establish a stable low-dimensional representation. The present study proposes a parallel deep autoencoder neural network architecture for manifold alignment and conducts a series of experiments using a protein-folding benchmark dataset and a suite of new datasets generated by simulating double-pendulum dynamics with underlying manifolds of dimensions 2, 3 and 4. The dimensionality and topological complexity of these latent manifolds are above those occurring in most previous studies. Our experimental results demonstrate that the parallel deep autoencoder performs in most cases better than the tested traditional methods of semi-supervised manifold alignment.
In this paper, we propose a generalized Unsupervised Manifold Alignment (GUMA) method to build the connections between different but correlated datasets without any known correspondences. Based on the assumption that datasets of the same theme usually have similar manifold structures, GUMA is formulated into an explicit integer optimization problem considering the structure matching and preserving criteria, as well as the feature comparability of the corresponding points in the mutual embedding space. The main benefits of this model include: (1) simultaneous discovery and alignment of manifold structures; (2) fully unsupervised matching without any pre-specified correspondences; (3) efficient iterative alignment without computations in all permutation cases. Experimental results on dataset matching and real-world applications demonstrate the effectiveness and the practicability of our manifold alignment method.
We propose a novel framework for combining datasets via alignment of their associated intrinsic dimensions. Our approach assumes that two datasets are sampled from a common latent space, i.e., they measure equivalent systems. Thus, we expect there to exist a natural (albeit unknown) alignment of the data manifolds associated with the intrinsic geometry of these datasets, which are perturbed by measurement artifacts in the sampling process. Importantly, we do not assume any individual correspondence (partial or complete) between data points. Instead, we rely on our assumption that a subset of data features have correspondence across datasets. We leverage this assumption to estimate relations between intrinsic manifold dimensions, which are given by diffusion map coordinates over each of the datasets. We compute a correlation matrix between diffusion coordinates of the datasets by considering graph (or manifold) Fourier coefficients of corresponding data features. We then orthogonalize this correlation matrix to form an isometric transformation between the diffusion maps of the datasets. Finally, we apply this transformation to the diffusion coordinates and construct a unified diffusion geometry of the datasets together. We show that this approach successfully corrects misalignment artifacts and enables data integration.
We present a locality preserving loss (LPL)that improves the alignment between vector space representations (i.e., word or sentence embeddings) while separating (increasing distance between) uncorrelated representations as compared to the standard method that minimizes the mean squared error (MSE) only. The locality preserving loss optimizes the projection by maintaining the local neighborhood of embeddings that are found in the source, in the target domain as well. This reduces the overall size of the dataset required to the train model. We argue that vector space alignment (with MSE and LPL losses) acts as a regularizer in certain language-based classification tasks, leading to better accuracy than the base-line, especially when the size of the training set is small. We validate the effectiveness ofLPL on a cross-lingual word alignment task, a natural language inference task, and a multi-lingual inference task.