Manifold alignment has been found to be useful in many fields of machine learning and data mining. In this paper we summarize our work in this area and introduce a general framework for manifold alignment. This framework generates a family of approaches to align manifolds by simultaneously matching the corresponding instances and preserving the local geometry of each given manifold. Some approaches like semi-supervised alignment and manifold projections can be obtained as special cases. Our framework can also solve multiple manifold alignment problems and be adapted to handle the situation when no correspondence information is available. The approaches are described and evaluated both theoretically and experimentally, providing results showing useful knowledge transfer from one domain to another. Novel applications of our methods including identification of topics shared by multiple document collections, and biological structure alignment are discussed in the paper.
Phylogenetic tree reconstruction is traditionally based on multiple sequence alignments (MSAs) and heavily depends on the validity of this information bottleneck. With increasing sequence divergence, the quality of MSAs decays quickly. Alignment-free methods, on the other hand, are based on abstract string comparisons and avoid potential alignment problems. However, in general they are not biologically motivated and ignore our knowledge about the evolution of sequences. Thus, it is still a major open question how to define an evolutionary distance metric between divergent sequences that makes use of indel information and known substitution models without the need for a multiple alignment. Here we propose a new evolutionary distance metric to close this gap. It uses finite-state transducers to create a biologically motivated similarity score which models substitutions and indels, and does not depend on a multiple sequence alignment. The sequence similarity score is defined in analogy to pairwise alignments and additionally has the positive semi-definite property. We describe its derivation and show in simulation studies and real-world examples that it is more accurate in reconstructing phylogenies than competing methods. The result is a new and accurate way of determining evolutionary distances in and beyond the twilight zone of sequence alignments that is suitable for large datasets.
Structural alignment involves finding equivalences between sequential positions in two proteins. As such, it is similar to sequence alignment. However, in structural alignment the equivalences are not found by comparing two strings of characters but rather by optimally superimposing two structures and finding the regions of closest overlap in three-dimensions (figure 1). Structural alignment is becoming increasingly important as the number of known protein structures increases exponentially. Currently, there are more than 5000 structures in the Protein Data Bank (exactly, 5208 as of September 1995). Structural alignment is also very important because it is usually thought of as providing a standard or target for sequence alignment. That is, one will be a long way towards achieving accurate sequence alignment if one can align two homologous but highly diverged proteins (say, with low percent identity of-15 %) on the basis of sequence as well as on the basis of structure.
When ontologies cover overlapping topics, the overlap can be represented using ontology alignments. These alignments need to be continuously adapted to changing ontologies. Especially for large ontologies this is a costly task often consisting of manual work. Finding changes that do not lead to an adaption of the alignment can potentially make this process significantly easier. This work presents an approach to finding these changes based on RDF embeddings and common classification techniques. To examine the feasibility of this approach, an evaluation on a real-world dataset is presented. In this evaluation, the best classifiers reached a precision of 0.8.