Extending Video Masked Autoencoders to 128 frames

Neural Information Processing Systems 

Video understanding has witnessed significant progress with recent video foundation models demonstrating strong performance owing to self-supervised pre-training objectives; Masked Autoencoders (MAE) being the design of choice.