Goto

Collaborating Authors

 Schroff, Florian


VideoPrism: A Foundational Visual Encoder for Video Understanding

arXiv.org Artificial Intelligence

We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding by global-local distillation of semantic video embeddings and a token shuffling scheme, enabling VideoPrism to focus primarily on the video modality while leveraging the invaluable text associated with videos. We extensively test VideoPrism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 31 out of 33 video understanding benchmarks.


Learning to Generate Image Embeddings with User-level Differential Privacy

arXiv.org Artificial Intelligence

Representation learning, by training deep neural networks as feature extractors to generate compact embedding vectors from images, is a fundamental component in computer vision. Metric learning, a kind of representation learning using supervised data, has been widely applied to image recognition, clustering, and retrieval [Schroff et al., 2015; Weinberger and Saul, 2009; Weyand et al., 2020]. Machine learning models have the capacity to memorize training data [Carlini et al., 2019, 2021], leading to privacy risks when the models are deployed. Privacy risk can also be audited by membership inference attacks [Carlini et al., 2022; Shokri et al., 2017], i.e. detecting whether certain data was used to train a model and potentially exposing users' usage behaviors. Defending against such risks is a critical responsibility when training on privacy-sensitive data. Differential Privacy (DP) [Dwork et al., 2006] is an extensively used quantifiable measurement of privacy risk, now generally accepted as a standard notion of privacy in both industry and government [Apple Privacy Team, 2017; Ding et al., 2017; McMahan and Thakurta, 2022; US Census Bureau, 2021]. Applied to machine learning, DP requires a training procedure with explicit randomness, and guarantees that the distribution over output models is quantifiably similar given a certain scope of change to the training dataset. A DP guarantee with respect to the change of a single arbitrary training example is known as example-level DP, which provides plausible deniability (in the binary hypothesis testing sense of [Kairouz et al., 2015]) that any single example (e.g., image) occurred The first two authors contributed equally.


Searching for Efficient Multi-Scale Architectures for Dense Image Prediction

Neural Information Processing Systems

The design of neural network architectures is an important component for achieving state-of-the-art performance with machine learning systems across a broad array of tasks. Much work has endeavored to design and build architectures automatically through clever construction of a search space paired with simple learning algorithms. Recent progress has demonstrated that such meta-learning methods may exceed scalable human-invented architectures on image classification tasks. An open question is the degree to which such methods may generalize to new domains. In this work we explore the construction of meta-learning techniques for dense image prediction focused on the tasks of scene parsing, person-part segmentation, and semantic image segmentation. Constructing viable search spaces in this domain is challenging because of the multi-scale representation of visual information and the necessity to operate on high resolution imagery. Based on a survey of techniques in dense image prediction, we construct a recursive search space and demonstrate that even with efficient random search, we can identify architectures that outperform human-invented architectures and achieve state-of-the-art performance on three dense prediction tasks including 82.7% on Cityscapes (street scene parsing), 71.3% on PASCAL-Person-Part (person-part segmentation), and 87.9% on PASCAL VOC 2012 (semantic image segmentation). Additionally, the resulting architecture is more computationally efficient, requiring half the parameters and half the computational cost as previous state of the art systems.


Searching for Efficient Multi-Scale Architectures for Dense Image Prediction

Neural Information Processing Systems

The design of neural network architectures is an important component for achieving state-of-the-art performance with machine learning systems across a broad array of tasks. Much work has endeavored to design and build architectures automatically through clever construction of a search space paired with simple learning algorithms. Recent progress has demonstrated that such meta-learning methods may exceed scalable human-invented architectures on image classification tasks. An open question is the degree to which such methods may generalize to new domains. In this work we explore the construction of meta-learning techniques for dense image prediction focused on the tasks of scene parsing, person-part segmentation, and semantic image segmentation. Constructing viable search spaces in this domain is challenging because of the multi-scale representation of visual information and the necessity to operate on high resolution imagery. Based on a survey of techniques in dense image prediction, we construct a recursive search space and demonstrate that even with efficient random search, we can identify architectures that outperform human-invented architectures and achieve state-of-the-art performance on three dense prediction tasks including 82.7% on Cityscapes (street scene parsing), 71.3% on PASCAL-Person-Part (person-part segmentation), and 87.9% on PASCAL VOC 2012 (semantic image segmentation). Additionally, the resulting architecture is more computationally efficient, requiring half the parameters and half the computational cost as previous state of the art systems.


Modeling Uncertainty with Hedged Instance Embedding

arXiv.org Machine Learning

Instance embeddings are an efficient and versatile image representation that facilitates applications like recognition, verification, retrieval, and clustering. Many metric learning methods represent the input as a single point in the embedding space. Often the distance between points is used as a proxy for match confidence. However, this can fail to represent uncertainty which can arise when the input is ambiguous, e.g., due to occlusion or blurriness. We introduce the hedged instance embedding (HIB) in which embeddings are modeled as random variables and the model is trained under the variational information bottleneck principle (Alemi et al., 2016; Achille & Soatto, 2018). Empirical results on our new N-digit MNIST dataset show that our method leads to the desired behavior of "hedging its bets" across the embedding space upon encountering ambiguous inputs. This results in improved performance for image matching and classification tasks, more structure in the learned embedding space, and an ability to compute a per-exemplar uncertainty measure which is correlated with downstream performance. Embeddings are a versatile representation that support various downstream tasks, including image retrieval (Babenko et al., 2014) and face recognition (Schroff et al., 2015). Instance embeddings are often treated deterministically, i.e., z f(x) is a point in R One drawback of this representation is the difficulty of modeling aleatoric uncertainty (Kendall & Gal, 2017), i.e. uncertainty induced by the input.


Searching for Efficient Multi-Scale Architectures for Dense Image Prediction

arXiv.org Machine Learning

The design of neural network architectures is an important component for achieving state-of-the-art performance with machine learning systems across a broad array of tasks. Much work has endeavored to design and build architectures automatically through clever construction of a search space paired with simple learning algorithms. Recent progress has demonstrated that such meta-learning methods may exceed scalable human-invented architectures on image classification tasks. An open question is the degree to which such methods may generalize to new domains. In this work we explore the construction of meta-learning techniques for dense image prediction focused on the tasks of scene parsing, person-part segmentation, and semantic image segmentation. Constructing viable search spaces in this domain is challenging because of the multi-scale representation of visual information and the necessity to operate on high resolution imagery. Based on a survey of techniques in dense image prediction, we construct a recursive search space and demonstrate that even with efficient random search, we can identify architectures that outperform human-invented architectures and achieve state-of-the-art performance on three dense prediction tasks including 82.7\% on Cityscapes (street scene parsing), 71.3\% on PASCAL-Person-Part (person-part segmentation), and 87.9\% on PASCAL VOC 2012 (semantic image segmentation). Additionally, the resulting architecture is more computationally efficient, requiring half the parameters and half the computational cost as previous state of the art systems.