The research presented in this paper is primarily concerned with the use of complementary textual resources in video and image analysis for supporting a higher level of automatic (semantic) annotation and indexing of images and videos. While in the past various projects (like the MUMIS project, see below for more details) used Information Extraction as the main mean for extracting relevant entities, relation and events from text that could be used for the indexing of images and videos in a specific domain, nowadays we can build on Semantic Web technologies and resources for detecting instances of semantic classes and relations in textual documents, and use those for supporting the semantic annotation and indexing of audiovisual content.
The broad challenge in my view is to exploit multilingual, multimedia information from both web and TV video to allow broader understanding of different ideological, social, and cultural perspectives in different sources, for a wide variety of applications. This will involve the judicious analysis of the text and video features using a variety of machine learning and language analysis methods, as well as understanding of the video editing structure, as well as the context in which the media appears. Other challenges involve dealing with the flood of data through mechanisms for intelligent, context-driven summarization, as our brains remain limited in the amount of information they can process. Yet different challenges concern the mobile use of multimedia data, considering limited bandwidth, small displays, limited multimedia input mechanisms as well as the social network that provides the context of the media - however, the latter challenge won't be addressed in the following paragraphs. At this point, the infrastructure for collecting massive amounts of multimedia and sensor data exists, and processing hardware is becoming relatively affordable.
Starting from the observation that certain communities have incentive mechanisms in place to create large amounts of unstructured content, we propose in this paper an original model which we expect to lead to the large number of annotations required to semantically enrich Web content at a large scale. The novelty of our model lies in the combination of two key ingredients: the effort that online communities are making to create content and the capability of machines to detect regular patterns in user annotation to suggest new annotations. Provided that the creation of semantic content is made easy enough and incentives are in place, we can assume that these communities will be willing to provide annotations. However, as human resources are clearly limited, we aim at integrating algorithmic support into our model to bootstrap on existing annotations and learn patterns to be used for suggesting new annotations. As the automatically extracted information needs to be validated, our model presents the extracted knowledge to the user in the form of questions, thus allowing for the validation of the information. In this paper, we describe the requirements on our model, its concrete implementation based on Semantic MediaWiki and an information extraction system and discuss lessons learned from practical experience with real users. These experiences allow us to conclude that our model is a promising approach towards leveraging semantic annotation.
In the following, I focus on one particular challenge: the integration of affective computing and multimedia information extraction. What are the critical technical challenges in multimedia information extraction (MMIE)? Work by Picard and others has created considerable awareness for the role of affect in human computer interaction. As key ingredients of affective computing Picard identifies recognizing, expressing, modelling, communicating, and responding to emotional information (Picard 2003). In the context of information extraction, methods of affective computing can be applied to enhance classical information extraction tasks by emotion and sentiment detection.
What are the critical technical challenges in multimedia information extraction (MMIE)? There are several challenges, on several fronts. Some of these include: - Detecting events of interest in video where there is no accompanying sound or text; examples include surveillance video. Further advances in computer vision, perhaps combining multiple 2D views are necessary. It is interesting to note that in the UK, it is almost impossible to walk outside for 5 minutes without being captured by some surveillance video system - Content extraction from noisy media, such as telephone conversations, home videos (as seen on youtube) - Correlating multimedia data to other data sources, especially text sources.