The research presented in this paper is primarily concerned with the use of complementary textual resources in video and image analysis for supporting a higher level of automatic (semantic) annotation and indexing of images and videos. While in the past various projects (like the MUMIS project, see below for more details) used Information Extraction as the main mean for extracting relevant entities, relation and events from text that could be used for the indexing of images and videos in a specific domain, nowadays we can build on Semantic Web technologies and resources for detecting instances of semantic classes and relations in textual documents, and use those for supporting the semantic annotation and indexing of audiovisual content.
A variety of technologies have been developed to index and manage information contained within unstructured documents and information sources. Although often originally developed for textual data many of these information management tools have been applied successfully to imperfectly indexed multimedia data.
In this article, we develop a framework for the building of domain ontologies and a semantic index based on two technologies: terminology extraction with LEXTER ( EDF R&D) and discourse and semantic annotation with EXCOM. We have selected two specific points of view for this study: causality and part-whole notions. In the first part of this paper, we explain the contributions of a terminology and the discursive and semantic relations for domain ontology building. In the second part we propose a semantic based index for information retrieval.
Human language content in video is typically manifested either as spoken audio that accompanies the visual content, or as text that is overlaid on the video or contained within the video scene itself. The bulk of research and engineering in language extraction from video thus far has focused on spoken language content. More recently, researchers have also developed technologies capable of detecting and recognizing text content in video. Anecdotal evidence indicates that in the case of rich multimedia sources such as Broadcast News, spoken and textual content provide complementary information. Here we present the results of a recent BBN study in which we compared named entities, a critically important type of language content, between aligned speech and videotext tracks. These new results show that videotext content provides significant additional information that does not appear in the speech stream.