Cross-Domain Matching for Bag-of-Words Data via Kernel Embeddings of Latent Distributions

Yoshikawa, Yuya, Iwata, Tomoharu, Sawada, Hiroshi, Yamada, Takeshi

Neural Information Processing Systems 

We propose a kernel-based method for finding matching between instances across different domains, such as multilingual documents and images with annotations. Each instance is assumed to be represented as a multiset of features, e.g., a bag-of-words representation for documents. The major difficulty in finding cross-domain relationships is that the similarity between instances in different domains cannot be directly measured. To overcome this difficulty, the proposed method embeds all the features of different domains in a shared latent space, and regards each instance as a distribution of its own features in the shared latent space. To represent the distributions efficiently and nonparametrically, we employ the framework of the kernel embeddings of distributions.