Video recognition models have progressed significantly over the past few years, evolving from shallow classifiers trained on hand-crafted features to deep spatiotemporal networks. However, labeled video data required to train such models has not been able to keep up with the ever increasing depth and sophistication of these networks. In this work we propose an alternative approach to learning video representations that requires no semantically labeled videos, and instead leverages the years of effort in collecting and labeling large and clean still-image datasets. We do so by using state-of-the-art models pre-trained on image datasets as "teachers" to train video models in a distillation framework. We demonstrate that our method learns truly spatiotemporal features, despite being trained only using supervision from still-image networks.
We present a new model DRNET that learns disentangled image representations from video. Our approach leverages the temporal coherence of video and a novel adversarial loss to learn a representation that factorizes each frame into a stationary part and a temporally varying component. The disentangled representation can be used for a range of tasks. For example, applying a standard LSTM to the time-vary components enables prediction of future frames. For the latter, we demonstrate the ability to coherently generate up to several hundred steps into the future.
Current computing systems are just beginning to enable the computational manipulation of temporal media like video and audio. Because of the opacity of these media they must be represented in order to be manipulable according to their contents. Knowledge representation techniques have been implicitly designed for representing the physical world and its textual representations. Temporal media pose unique problems and opportunities for knowledge representation which challenge many of its assumptions about the structure and function of what is represented. The semantics and syntax of temporal media require representational designs which employ fundamentally different conceptions of space, time, identity, and action. In particular, the effect of the syntax of video sequences on the semantics of video shots demands a representational design which can clearly articulate the differences between the context-dependent and contextindependent semantics of video data. This paper outlines the theoretical foundations for designing representations of video, discusses Media Streams, an implemented system for video representation and retrieval, and critiques related efforts in this area.
We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data collected in the wild. We leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos. Unlabeled video has the advantage that it can be economically acquired at massive scales, yet contains useful signals about natural sound. We propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge. Our sound representation yields significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene/object classification.