TokenLearner: Adaptive Space-Time T okenization for Videos - Supplementary Materials - Michael S. Ryoo
–Neural Information Processing Systems
We train the Kinetics model for 30 epochs with the base learning rate of 0.05 with the Momentum Basically, all the settings in our Kinetics experiments follow the setting of ViViT. We provide the training details as below. We use the cosine-decay learning rate which was popularly used in many video CNN model trainings. The base learning rate of 0.8 per TPU core (which is equivalent to a single GPU) is used for the Charades dataset (multi-label action classification) and the base rate of 0.025 per TPU was used for A ViD. Label smoothing of 0.2 was used for the A ViD training. In Charades, the training was done by temporally cropping a long Charades videos (e.g., In all these experiments, ViT L/16 model was used.
Neural Information Processing Systems
Aug-14-2025, 23:53:24 GMT