Supplementary Materials for MAViL: Masked Audio-Video Learners

Neural Information Processing Systems 

These results are obtained using the stage-1 MAViL's decoders, In D, we discuss MAViL's societal impact and limitations. Figure 1: Video clip and spectrogram reconstruction on the AudioSet eval set. We sample 4 paired (video, audio) examples as follows: Top left: a puppy video; Top right: a recording from an ambulance's dash camera; Bottom left: a person dialing a phone in a dark room; Bottom right: a singer dancing. In each 3-row group, we show the original video and its audio spectrogram (top), masked input to MAViL (middle), and MAViL's video and audio spectrogram reconstructions (bottom). The spectrogram shape is 1024 128; patch size is 16 16.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found