Detection-Fusion for Knowledge Graph Extraction from Videos

Das, Taniya, Mahon, Louis, Lukasiewicz, Thomas

arXiv.org Artificial Intelligence 

Visual understanding has been a central question in AI since the inception of the field. However, it is not obvious how to quantify whether a machine can understand what it sees. One simple way is classification, and indeed, much of the computer vision research over the last ten years has centered around ImageNet. Object classification performance is very easy to measure, but it only conveys a coarse description of the image and misses further information about the properties and relations of the present objects. Another approach is to generate a natural language sentence describing the visual contents. This escapes the limitation of classification and is capable of expressing all the complexity that natural language can express. However, using natural language comes with a number of disadvantages. It means the model not only has to learn to understand the contents of the video but also how to express this content in natural language, which is a significant additional requirement. Even in humans, understanding is quite a separate problem from articulation in language, as evidenced by patients with damage to Broca's area in the brain, which show normal understanding of visual and even linguistic information [2], but struggle to articulate this understanding in