Representation learning is a key element of state-of-the-art deep learning approaches. It enables to transform raw data into structured vector space embeddings. Such embeddings are able to capture the distributional semantics of their context, e.g. by word windows on natural language sentences, graph walks on knowledge graphs or convolutions on images. So far, this context is manually defined, resulting in heuristics which are solely optimized for computational performance on certain tasks like link-prediction. However, such heuristic models of context are fundamentally different to how humans capture information. For instance, when reading a multi-modal webpage (i) humans do not perceive all parts of a document equally: Some words and parts of images are skipped, others are revisited several times which makes the perception trace highly non-sequential; (ii) humans construct meaning from a document's content by shifting their attention between text and image, among other things, guided by layout and design elements. In this paper we empirically investigate the difference between human perception and context heuristics of basic embedding models. We conduct eye tracking experiments to capture the underlying characteristics of human perception of media documents containing a mixture of text and images. Based on that, we devise a prototypical computational perception-trace model, called CMPM. We evaluate empirically how CMPM can improve a basic skip-gram embedding approach. Our results suggest, that even with a basic human-inspired computational perception model, there is a huge potential for improving embeddings since such a model does inherently capture multiple modalities, as well as layout and design elements.
This survey discusses how recent developments in multimodal processing facilitate conceptual grounding of language. We categorize the information flow in multimodal processing with respect to cognitive models of human information processing and analyze different methods for combining multimodal representations. Based on this methodological inventory, we discuss the benefit of multimodal grounding for a variety of language processing tasks and the challenges that arise. We particularly focus on multimodal grounding of verbs which play a crucial role for the compositional power of language.
Language and vision provide complementary information. Integrating both modalities in a single multimodal representation is an unsolved problem with wide-reaching applications to both natural language processing and computer vision. In this paper, we present a simple and effective method that learns a language-to-vision mapping and uses its output visual predictions to build multimodal representations. In this sense, our method provides a cognitively plausible way of building representations, consistent with the inherently re-constructive and associative nature of human memory. Using seven benchmark concept similarity tests we show that the mapped (or imagined) vectors not only help to fuse multimodal information, but also outperform strong unimodal baselines and state-of-the-art multimodal methods, thus exhibiting more human-like judgments. Ultimately, the present work sheds light on fundamental questions of natural language understanding concerning the fusion of vision and language such as the plausibility of more associative and re-constructive approaches.
Integrating visual and linguistic information into a single multimodal representation is an unsolved problem with wide-reaching applications to both natural language processing and computer vision. In this paper, we present a simple method to build multimodal representations by learning a language-to-vision mapping and using its output to build multimodal embeddings. In this sense, our method provides a cognitively plausible way of building representations, consistent with the inherently re-constructive and associative nature of human memory. Using seven benchmark concept similarity tests we show that the mapped vectors not only implicitly encode multimodal information, but also outperform strong unimodal baselines and state-of-the-art multimodal methods, thus exhibiting more "human-like" judgments---particularly in zero-shot settings.
Over the past years, distributed semantic representations have proved to be effective and flexible keepers of prior knowledge to be integrated into downstream applications. This survey focuses on the representation of meaning. We start from the theoretical background behind word vector space models and highlight one of their major limitations: the meaning conflation deficiency, which arises from representing a word with all its possible meanings as a single vector. Then, we explain how this deficiency can be addressed through a transition from the word level to the more fine-grained level of word senses (in its broader acceptation) as a method for modelling unambiguous lexical meaning. We present a comprehensive overview of the wide range of techniques in the two main branches of sense representation, i.e., unsupervised and knowledge-based. Finally, this survey covers the main evaluation procedures and applications for this type of representation, and provides an analysis of four of its important aspects: interpretability, sense granularity, adaptability to different domains and compositionality.