Plotting

MAViL: Masked Audio-Video Learners Po-Yao Huang 1 Chaitanya Ryali

Neural Information Processing Systems

We present Masked Audio-Video Learners (MAViL) to learn audio-visual representations with three complementary forms of self-supervision: (1) reconstructing masked raw audio and video inputs, (2) intra-modal and inter-modal contrastive learning with masking, and (3) self-training to predict aligned and contextualized audio-video representations learned from the first two objectives. Empirically, MAViL achieves state-of-the-art audio-video classification performance on AudioSet (53.3 mAP) and VGGSound (67.1% accuracy), surpassing recent self-supervised models and supervised models that utilize external labeled data. Notably, pre-training with MAViL not only enhances performance in multimodal classification and retrieval tasks, but it also improves the representations of each modality in isolation, without relying on information from the other modality during uni-modal fine-tuning or inference.



f50a6c02a3fc5a3a5d4d9391f05f3efc-AuthorFeedback.pdf

Neural Information Processing Systems

We thank the reviewers for their feedback. We're glad that all reviewers agree that the paper is well-written and that side effect avoidance is an important AI safety We will include results for DQN and for additional auxiliary reward functions. Unfortunately, neither approach is remotely viable in SafeLife. We estimate that there are billions of reachable states in any given SafeLife level. We share their interest in this prospect.


Supplementary Materials for MLP-Mixer: An all-MLP Architecture for Vision, Lucas Beyer

Neural Information Processing Systems

A.1 Modifying the token-mixing MLPs We ablated a number of ideas trying to improve the token-mixing MLPs for Mixer models of various scales pre-trained on JFT-300M. Instead, we could introduce C separate MLPs with independent weights, effectively multiplying the number of parameters by C. We did not observe any noticeable improvements. Grouping the channels together Token-mixing MLPs take S-dimensional vectors as inputs. Every such vector contains values of a single feature across S different spatial locations. In other words, token-mixing MLPs operate by looking at only one channel at once.



MSA Generation with Seqs2Seqs Pretraining: Advancing Protein Structure Predictions

Neural Information Processing Systems

Deep learning models like AlphaFold2 (Jumper et al., 2021) have revolutionized protein structure prediction, achieving unprecedented accuracy. However, the dependence on robust multiple sequence alignments (MSAs) continues to pose a challenge, especially for proteins that lack a wealth of homologous sequences. To overcome this limitation, we introduce MSA-Generator, a self-supervised generative protein language model. Trained on a sequence-to-sequence task using an automatically constructed dataset, MSA-Generator employs protein-specific attention mechanisms to harness large-scale protein databases, generating virtual MSAs that enrich existing ones and boost prediction accuracy. Our experiments on CASP14 and CASP15 benchmarks reveal significant improvements in LDDT scores, particularly for complex and challenging sequences, enhancing the performance of both AlphaFold2 and RoseTTAFold. The code is released at https://github.com/lezhang7/MSAGen.


A Metalearned Neural Circuit for Nonparametric Bayesian Inference

Neural Information Processing Systems

Most applications of machine learning to classification assume a closed set of balanced classes. This is at odds with the real world, where class occurrence statistics often follow a long-tailed power-law distribution, rarely revealing the entire problem domain in a single sample. Nonparametric Bayesian models naturally capture this phenomenon, but have significant practical barriers to widespread adoption, namely implementation complexity and computational inefficiency. To address this, we present a method for extracting the inductive bias from a nonparametric Bayesian model and transferring it to an artificial neural network. By simulating data with a nonparametric Bayesian prior, we can metalearn a sequence model that performs inference over an unlimited set of classes. After training, this "neural circuit" has distilled the corresponding inductive bias and can successfully perform sequential inference over an open set of classes. Our experimental results show that the metalearned neural circuit achieves comparable or better performance than particle filter-based methods that explicitly perform Bayesian nonparametric inference while being faster and simpler to use.


Personalized Federated Learning with Moreau Envelopes: Supplementary Materials, Nguyen H. Tran 1

Neural Information Processing Systems

In this appendix we provide proofs for the theorems and lemmas in the paper "Personalized Federated Learning with Moreau Envelopes", as well as additional experimental settings and results.



Generative Neural Fields by Mixtures of Neural Implicit Functions

Neural Information Processing Systems

We propose a novel approach to learning the generative neural fields represented by linear combinations of implicit basis networks. Our algorithm learns basis networks in the form of implicit neural representations and their coefficients in a latent space by either conducting meta-learning or adopting auto-decoding paradigms. The proposed method easily enlarges the capacity of generative neural fields by increasing the number of basis networks while maintaining the size of a network for inference to be small through their weighted model averaging. Consequently, sampling instances using the model is efficient in terms of latency and memory footprint. Moreover, we customize denoising diffusion probabilistic model for a target task to sample latent mixture coefficients, which allows our final model to generate unseen data effectively. Experiments show that our approach achieves competitive generation performance on diverse benchmarks for images, voxel data, and NeRF scenes without sophisticated designs for specific modalities and domains.