Appendix

Aug-14-2025, 10:19:19 GMT–Neural Information Processing Systems

In this appendix, we provide more details of VideoMAE from the following aspects: The detailed architecture illustration is in A. The implementation details are in B. Results analysis is in D. Visualization of reconstructed samples is in E. License of the datasets is in F. We take 16-frame vanilla ViT -Base for example. We conduct the experiments with 64 GPUs for both pre-training and fine-tuning on the Something-Something V2 and Kinetics-400 datasets. The experiments on the A V A dataset are conducted with 32 GPUs. For evaluation, all models share the same inference protocol, i.e., 2 clips Our VideoMAE is pre-trained for 800 epochs on Kinetics-400 by default. We follow a similar recipe on Kinetics for pre-training.

dataset, fine-tuning, videomae, (15 more...)

Neural Information Processing Systems

Aug-14-2025, 10:19:19 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (1.00)

Duplicate Docs Excel Report

Title
416f9cb3276121c42eebb86352a4354a-Supplemental-Conference.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found