Appendix
–Neural Information Processing Systems
In this appendix, we provide more details of VideoMAE from the following aspects: The detailed architecture illustration is in A. The implementation details are in B. Results analysis is in D. Visualization of reconstructed samples is in E. License of the datasets is in F. We take 16-frame vanilla ViT -Base for example. We conduct the experiments with 64 GPUs for both pre-training and fine-tuning on the Something-Something V2 and Kinetics-400 datasets. The experiments on the A V A dataset are conducted with 32 GPUs. For evaluation, all models share the same inference protocol, i.e., 2 clips Our VideoMAE is pre-trained for 800 epochs on Kinetics-400 by default. We follow a similar recipe on Kinetics for pre-training.
Neural Information Processing Systems
Aug-14-2025, 10:19:19 GMT
- Technology: