While today's deep learning systems are able to natively analyze video, the large file sizes of high resolution movies present unique challenges in terms of storage space and computational requirements. Sampling them into sequences of still images not only allows for real-time processing of unlimited-length videos but opens the door for creative new applications like "video ngrams." The most straightforward way to sample a video into a sequence of still images is to use a fixed-rate time-based mechanism such as one frame per second. This kind of sampling is supported natively by most tools like ffmpeg and provides a simplistic and robust workflow. At the same time, it is highly inefficient, especially for videos where there is a lot of repetition.
Deep learning has revolutionized the machine understanding of imagery. Yet today's image recognition models are still limited by the availability of large annotated training datasets upon which to build their libraries of recognized objects and activities. To address this, Google's Vision AI API expands its native catalog of around 10,000 visually recognized objects and activities with the ability to perform the equivalent of a reverse Google Images search across the open Web and tally up the top topics used to caption the given image everywhere it has previously appeared, lending unprecedentedly rich context and understanding, even yielding unique labels for breaking news events. What might this process yield for a week of television news? Google's Vision AI API represents a unique hybrid between traditional deep learning-based image labeling based on a library of previously trained models and the ability to leverage the open Web to annotate images based on the most common topics visually similar images are captioned with.
Television news coverage brings to mind images of newsreaders in studios, reporters in the field, previously recorded footage and rapid-fire barrages of vivid advertising imagery. This raises the question of just how long a typical "shot" lasts and whether there are substantial differences between television news stations. Using the "Shot Change" detection feature of Google's Video AI platform to analyze a week of television news, what new insights could we learn about the speed at which television news narratives move? Google's Video AI API brings the company's image analysis algorithms to the world of video. While in the past videos had to be split into frames and analyzed as still images, the Video AI API enables videos to be analyzed natively, enabling time-based analysis like detecting shot changes.
What would it look like to ask a deep learning AI system to watch every political television advertisement of the 2016 presidential campaign season for two months and describe what it sees? That was the question I asked last February when I collaborated with the Internet Archive to take all 267 political ads they had identified (which had aired a collective 72,807 times as monitored by the Archive) and ran them frame-by-frame through Google's Cloud Vision API, producing what is likely the first large-scale application of production deep learning algorithms to describe the visual narratives of political advertising on television. Now, what if we took this same approach and instead of examining television, we looked at a quarter billion news photographs compiled from online news outlets in nearly every country of the world over the course of 2016? What would AI see in that vast archive of the visual narratives of the world's media? Google's Cloud Vision API is a commercial cloud service that accepts as input any arbitrary photograph and uses deep learning algorithms to catalog a wealth of data about each image, including a list of objects and activities it depicts, recognizable logos, OCR text recognition in almost 80 languages, levels of violence, an estimate of visual sentiment and even the precise location on earth the image appears to depict.
Because of the rich dynamical structure of videos and their ubiquity in everyday life, it is a natural idea that video data could serve as a powerful unsupervised learning signal for training visual representations in deep neural networks. However, instantiating this idea, especially at large scale, has remained a significant artificial intelligence challenge. Here we present the Video Instance Embedding (VIE) framework, which extends powerful recent unsupervised loss functions for learning deep nonlinear embeddings to multi-stream temporal processing architectures on large-scale video datasets. We show that VIE-trained networks substantially advance the state of the art in unsupervised learning from video datastreams, both for action recognition in the Kinetics dataset, and object recognition in the ImageNet dataset. We show that a hybrid model with both static and dynamic processing pathways is optimal for both transfer tasks, and provide analyses indicating how the pathways differ. Taken in context, our results suggest that deep neural embeddings are a promising approach to unsupervised visual learning across a wide variety of domains.