Plotting


VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text Columbia University Google Cornell University

Neural Information Processing Systems

We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks. We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval. Furthermore, we study a modality-agnostic, single-backbone Transformer by sharing weights among the three modalities. We show that the convolution-free VATT outperforms state-of-the-art ConvNet-based architectures in the downstream tasks. Especially, VATT's vision Transformer achieves the top-1 accuracy of 82.1% on Kinetics-400, 83.6% on Kinetics-600, 72.7% on Kinetics-700, and 41.1% on Moments in Time, new records while avoiding supervised pre-training. Transferring to image classification leads to 78.7% top-1 accuracy on ImageNet compared to 64.7% by training the same Transformer from scratch, showing the generalizability of our model despite the domain gap between videos and images. VATT's audio Transformer also sets a new record on waveform-based audio event recognition by achieving the mAP of 39.4% on AudioSet without any supervised pre-training. VATT's source code is publicly available.


A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing

Neural Information Processing Systems

Recent developments of vision large language models (LLMs) have seen remarkable progress, yet still encounter challenges towards multimodal generalists, such as coarse-grained instance-level understanding, lack of unified support for both images and videos, and insufficient coverage across various vision tasks.



Graph Neural Flows for Unveiling Systemic Interactions Among Irregularly Sampled Time Series

Neural Information Processing Systems

Interacting systems are prevalent in nature. It is challenging to accurately predict the dynamics of the system if its constituent components are analyzed independently. We develop a graph-based model that unveils the systemic interactions of time series observed at irregular time points, by using a directed acyclic graph to model the conditional dependencies (a form of causal notation) of the system components and learning this graph in tandem with a continuous-time model that parameterizes the solution curves of ordinary differential equations (ODEs). Our technique, a graph neural flow, leads to substantial enhancements over non-graph-based methods, as well as graph-based methods without the modeling of conditional dependencies. We validate our approach on several tasks, including time series classification and forecasting, to demonstrate its efficacy.



Appendix: Supplementary material A Detailed Derivation of Formula 4

Neural Information Processing Systems

We state the PAC-Bayes theorem (Section 4) which bounds the generalization error of any posterior distribution Q on parameters that can be reached using the training set given a prior distribution P on parameters that should be chosen in advance and before observing the training set. Let Q and P be k-dimensional Gaussian distributions (Jiang et al., 2020), the KL-term can be simply written as Z KL(N(µ Nevertheless, we have contributed theoretically to better capture the true posterior by (1) relaxing an i.i.d. We recognize that our hypothetical covariance only considers the linear correlation between weights of neurons (filters). There is a gap between our hypothetical covariance and true covariance. But we also remark that, an estimation of the "true" posterior from data is also problematic, (e.g., use sharpness-like methods (Keskar et al., 2016) to get samplings parameters and estimate the covariance), may easily lead to further question on the accuracy of estimation and intractable derivation in theory.


f48c04ffab49ff0e5d1176244fdfb65c-Paper.pdf

Neural Information Processing Systems

This paper studies the novel concept of weight correlation in deep neural networks and discusses its impact on the networks' generalisation ability. For fully-connected layers, the weight correlation is defined as the average cosine similarity between weight vectors of neurons, and for convolutional layers, the weight correlation is defined as the cosine similarity between filter matrices. Theoretically, we show that, weight correlation can, and should, be incorporated into the PAC Bayesian framework for the generalisation of neural networks, and the resulting generalisation bound is monotonic with respect to the weight correlation. We formulate a new complexity measure, which lifts the PAC Bayes measure with weight correlation, and experimentally confirm that it is able to rank the generalisation errors of a set of networks more precisely than existing measures. More importantly, we develop a new regulariser for training, and provide extensive experiments that show that the generalisation error can be greatly reduced with our novel approach.


VPGTrans: Transfer Visual Prompt Generator across LLMs 2

Neural Information Processing Systems

Since developing a new multimodal LLM (MLLM) by pre-training on a tremendous amount of image-text pairs from scratch is exceedingly resource-consuming, connecting an existing LLM with a comparatively lightweight visual prompt generator (VPG) becomes a feasible paradigm.