While today's deep learning systems are able to natively analyze video, the large file sizes of high resolution movies present unique challenges in terms of storage space and computational requirements. Sampling them into sequences of still images not only allows for real-time processing of unlimited-length videos but opens the door for creative new applications like "video ngrams." The most straightforward way to sample a video into a sequence of still images is to use a fixed-rate time-based mechanism such as one frame per second. This kind of sampling is supported natively by most tools like ffmpeg and provides a simplistic and robust workflow. At the same time, it is highly inefficient, especially for videos where there is a lot of repetition.
This past May I worked with the Internet Archive's Television News Archive to apply Google's suite of cloud AI APIs to analyze a week of television news coverage to examine how AI "sees" television and what insights we might gain into the world of non-consumptive deep learning-powered video understanding. Using Google's video, image, speech and natural language APIs as lenses, more than 600GB of machine annotations trace how deep learning algorithms today understand video. What lessons can we learn about the state of AI today and how it can be applied in creative ways to catalog and explore the vast world of video? Working with the Internet Archive's Television News Archive, a week of television news was selected covering CNN, MSNBC and Fox News and the morning and evening broadcasts of San Francisco affiliates KGO (ABC), KPIX (CBS), KNTV (NBC) and KQED (PBS) from April 15 to April 22, 2019, totaling 812 hours of television news. This week was selected due to it having two major stories, one national (the Mueller report release on April 18th) and one international (the Notre Dame fire on April 15th).
Deep learning has revolutionized the machine understanding of imagery. Yet today's image recognition models are still limited by the availability of large annotated training datasets upon which to build their libraries of recognized objects and activities. To address this, Google's Vision AI API expands its native catalog of around 10,000 visually recognized objects and activities with the ability to perform the equivalent of a reverse Google Images search across the open Web and tally up the top topics used to caption the given image everywhere it has previously appeared, lending unprecedentedly rich context and understanding, even yielding unique labels for breaking news events. What might this process yield for a week of television news? Google's Vision AI API represents a unique hybrid between traditional deep learning-based image labeling based on a library of previously trained models and the ability to leverage the open Web to annotate images based on the most common topics visually similar images are captioned with.
Television news coverage brings to mind images of newsreaders in studios, reporters in the field, previously recorded footage and rapid-fire barrages of vivid advertising imagery. This raises the question of just how long a typical "shot" lasts and whether there are substantial differences between television news stations. Using the "Shot Change" detection feature of Google's Video AI platform to analyze a week of television news, what new insights could we learn about the speed at which television news narratives move? Google's Video AI API brings the company's image analysis algorithms to the world of video. While in the past videos had to be split into frames and analyzed as still images, the Video AI API enables videos to be analyzed natively, enabling time-based analysis like detecting shot changes.
What would it look like to ask a deep learning AI system to watch every political television advertisement of the 2016 presidential campaign season for two months and describe what it sees? That was the question I asked last February when I collaborated with the Internet Archive to take all 267 political ads they had identified (which had aired a collective 72,807 times as monitored by the Archive) and ran them frame-by-frame through Google's Cloud Vision API, producing what is likely the first large-scale application of production deep learning algorithms to describe the visual narratives of political advertising on television. Now, what if we took this same approach and instead of examining television, we looked at a quarter billion news photographs compiled from online news outlets in nearly every country of the world over the course of 2016? What would AI see in that vast archive of the visual narratives of the world's media? Google's Cloud Vision API is a commercial cloud service that accepts as input any arbitrary photograph and uses deep learning algorithms to catalog a wealth of data about each image, including a list of objects and activities it depicts, recognizable logos, OCR text recognition in almost 80 languages, levels of violence, an estimate of visual sentiment and even the precise location on earth the image appears to depict.