Plate from Muybridge's Animal Locomotion series published in 1887. Deep learning has become the dominate lens through which machines understand video. Yet video files consume huge amounts of storage space and are extremely computationally demanding to analyze using deep learning. Certain use cases can benefit from converting videos to sequences of still images for analysis, enabling full data parallelism and vast reductions in data storage and computation. Representing video as still imagery also presents unique opportunities for non-consumptive analysis similar to the use of ngrams for text.
This past May I worked with the Internet Archive's Television News Archive to apply Google's suite of cloud AI APIs to analyze a week of television news coverage to examine how AI "sees" television and what insights we might gain into the world of non-consumptive deep learning-powered video understanding. Using Google's video, image, speech and natural language APIs as lenses, more than 600GB of machine annotations trace how deep learning algorithms today understand video. What lessons can we learn about the state of AI today and how it can be applied in creative ways to catalog and explore the vast world of video? Working with the Internet Archive's Television News Archive, a week of television news was selected covering CNN, MSNBC and Fox News and the morning and evening broadcasts of San Francisco affiliates KGO (ABC), KPIX (CBS), KNTV (NBC) and KQED (PBS) from April 15 to April 22, 2019, totaling 812 hours of television news. This week was selected due to it having two major stories, one national (the Mueller report release on April 18th) and one international (the Notre Dame fire on April 15th).
Deep learning has revolutionized the machine understanding of imagery. Yet today's image recognition models are still limited by the availability of large annotated training datasets upon which to build their libraries of recognized objects and activities. To address this, Google's Vision AI API expands its native catalog of around 10,000 visually recognized objects and activities with the ability to perform the equivalent of a reverse Google Images search across the open Web and tally up the top topics used to caption the given image everywhere it has previously appeared, lending unprecedentedly rich context and understanding, even yielding unique labels for breaking news events. What might this process yield for a week of television news? Google's Vision AI API represents a unique hybrid between traditional deep learning-based image labeling based on a library of previously trained models and the ability to leverage the open Web to annotate images based on the most common topics visually similar images are captioned with.
Because of the rich dynamical structure of videos and their ubiquity in everyday life, it is a natural idea that video data could serve as a powerful unsupervised learning signal for training visual representations in deep neural networks. However, instantiating this idea, especially at large scale, has remained a significant artificial intelligence challenge. Here we present the Video Instance Embedding (VIE) framework, which extends powerful recent unsupervised loss functions for learning deep nonlinear embeddings to multi-stream temporal processing architectures on large-scale video datasets. We show that VIE-trained networks substantially advance the state of the art in unsupervised learning from video datastreams, both for action recognition in the Kinetics dataset, and object recognition in the ImageNet dataset. We show that a hybrid model with both static and dynamic processing pathways is optimal for both transfer tasks, and provide analyses indicating how the pathways differ. Taken in context, our results suggest that deep neural embeddings are a promising approach to unsupervised visual learning across a wide variety of domains.
Most of the classical denoising methods restore clear results by selecting and averaging pixels in the noisy input. Instead of relying on hand-crafted selecting and averaging strategies, we propose to explicitly learn this process with deep neural networks. Specifically, we propose deformable 2D kernels for image denoising where the sampling locations and kernel weights are both learned. The proposed kernel naturally adapts to image structures and could effectively reduce the oversmoothing artifacts. Furthermore, we develop 3D deformable kernels for video denoising to more efficiently sample pixels across the spatial-temporal space. Our method is able to solve the misalignment issues of large motion from dynamic scenes. For better training our video denoising model, we introduce the trilinear sampler and a new regularization term. We demonstrate that the proposed method performs favorably against the state-of-the-art image and video denoising approaches on both synthetic and real-world data.