The problem of consciousness has captured the imagination of philosophers, neuroscientists, and the general public, but has received little attention within AI. However, concepts from robotics and computer vision hold great promise to account for the major aspects of the phenomenon of consciousness, including philosophically problematical aspects such as the vividness of qualia, the first-person character of conscious experience, and the property of intentionality. This paper presents and evaluates such an account against eleven features of consciousness "that any philosophical-scientific theory should hope to explain", according to the philosopher and prominent AI critic John Searle.
This paper presents a multimodal learning system that can ground spoken names of objects in their physical referents and learn to recognize those objects simultaneously from naturally cooccurring multisensory input. There are two technical problems involved: (1) the correspondence problem in symbol grounding - how to associate words (symbols) with their perceptually grounded meanings from multiple cooccurrences between words and objects in the physical environment.
This paper examines to what degree current deep learning architectures for image caption generation capture spatial language. On the basis of the evaluation of examples of generated captions from the literature we argue that systems capture what objects are in the image data but not where these objects are located: the captions generated by these systems are the output of a language model conditioned on the output of an object detector that cannot capture fine-grained location information. Although language models provide useful knowledge for image captions, we argue that deep learning image captioning architectures should also model geometric relations between objects.
This survey discusses how recent developments in multimodal processing facilitate conceptual grounding of language. We categorize the information flow in multimodal processing with respect to cognitive models of human information processing and analyze different methods for combining multimodal representations. Based on this methodological inventory, we discuss the benefit of multimodal grounding for a variety of language processing tasks and the challenges that arise. We particularly focus on multimodal grounding of verbs which play a crucial role for the compositional power of language.