The research presented in this paper is primarily concerned with the use of complementary textual resources in video and image analysis for supporting a higher level of automatic (semantic) annotation and indexing of images and videos. While in the past various projects (like the MUMIS project, see below for more details) used Information Extraction as the main mean for extracting relevant entities, relation and events from text that could be used for the indexing of images and videos in a specific domain, nowadays we can build on Semantic Web technologies and resources for detecting instances of semantic classes and relations in textual documents, and use those for supporting the semantic annotation and indexing of audiovisual content.
The broad challenge in my view is to exploit multilingual, multimedia information from both web and TV video to allow broader understanding of different ideological, social, and cultural perspectives in different sources, for a wide variety of applications. This will involve the judicious analysis of the text and video features using a variety of machine learning and language analysis methods, as well as understanding of the video editing structure, as well as the context in which the media appears. Other challenges involve dealing with the flood of data through mechanisms for intelligent, context-driven summarization, as our brains remain limited in the amount of information they can process. Yet different challenges concern the mobile use of multimedia data, considering limited bandwidth, small displays, limited multimedia input mechanisms as well as the social network that provides the context of the media - however, the latter challenge won't be addressed in the following paragraphs. At this point, the infrastructure for collecting massive amounts of multimedia and sensor data exists, and processing hardware is becoming relatively affordable.
Starting from the observation that certain communities have incentive mechanisms in place to create large amounts of unstructured content, we propose in this paper an original model which we expect to lead to the large number of annotations required to semantically enrich Web content at a large scale. The novelty of our model lies in the combination of two key ingredients: the effort that online communities are making to create content and the capability of machines to detect regular patterns in user annotation to suggest new annotations. Provided that the creation of semantic content is made easy enough and incentives are in place, we can assume that these communities will be willing to provide annotations. However, as human resources are clearly limited, we aim at integrating algorithmic support into our model to bootstrap on existing annotations and learn patterns to be used for suggesting new annotations. As the automatically extracted information needs to be validated, our model presents the extracted knowledge to the user in the form of questions, thus allowing for the validation of the information. In this paper, we describe the requirements on our model, its concrete implementation based on Semantic MediaWiki and an information extraction system and discuss lessons learned from practical experience with real users. These experiences allow us to conclude that our model is a promising approach towards leveraging semantic annotation.
This paper addresses the problem of performing accurate semantic annotations in a large corpus. The task of creating a sense tagged corpus is different from the word sense disambiguation problem in that the semantic annotations have to be highly accurate, even if the price to be paid is lower coverage. While the state-of-the-art in word sense disambiguation does not exceed 70% precision, we want to find the means to perform semantic annotations with an accuracy close to 100%. We deal with this problem in the process of disambiguating the definitions in the WordNet dictionary. We propose in this paper a method that is able to tag words with high precision, using pattern extraction followed by pattern matching. This algorithm exploits the idiosyncratic nature of the corpus to be tagged, and achieves a precision of 99% with a coverage of 6%, measured on a WordNet subset, respectively more than 12.5% coverage estimated for the entire WordNet.
In the following, I focus on one particular challenge: the integration of affective computing and multimedia information extraction. What are the critical technical challenges in multimedia information extraction (MMIE)? Work by Picard and others has created considerable awareness for the role of affect in human computer interaction. As key ingredients of affective computing Picard identifies recognizing, expressing, modelling, communicating, and responding to emotional information (Picard 2003). In the context of information extraction, methods of affective computing can be applied to enhance classical information extraction tasks by emotion and sentiment detection.