The research presented in this paper is primarily concerned with the use of complementary textual resources in video and image analysis for supporting a higher level of automatic (semantic) annotation and indexing of images and videos. While in the past various projects (like the MUMIS project, see below for more details) used Information Extraction as the main mean for extracting relevant entities, relation and events from text that could be used for the indexing of images and videos in a specific domain, nowadays we can build on Semantic Web technologies and resources for detecting instances of semantic classes and relations in textual documents, and use those for supporting the semantic annotation and indexing of audiovisual content.
A variety of technologies have been developed to index and manage information contained within unstructured documents and information sources. Although often originally developed for textual data many of these information management tools have been applied successfully to imperfectly indexed multimedia data.
Human language content in video is typically manifested either as spoken audio that accompanies the visual content, or as text that is overlaid on the video or contained within the video scene itself. The bulk of research and engineering in language extraction from video thus far has focused on spoken language content. More recently, researchers have also developed technologies capable of detecting and recognizing text content in video. Anecdotal evidence indicates that in the case of rich multimedia sources such as Broadcast News, spoken and textual content provide complementary information. Here we present the results of a recent BBN study in which we compared named entities, a critically important type of language content, between aligned speech and videotext tracks. These new results show that videotext content provides significant additional information that does not appear in the speech stream.
In this paper we examine the effectiveness of using a filtered stream of tweets from Twitter to automatically identify events of interest within the video of live sports transmissions. We show that using just the volume of tweets generated at any moment of a game actually provides a very accurate means of event detection, as well as an automatic method for tagging events with representative words from the tweet stream. We compare this method with an alternative approach that uses complex audio-visual content analysis of the video, showing that it provides near-equivalent accuracy for major event detection at a fraction of the computational cost. Using community tweets and discussion also provides a sense of what the audience themselves found to be the talking points of a video.