TokenLearner: Adaptive Space-Time T okenization for Videos - Supplementary Materials - Michael S. Ryoo

Aug-14-2025, 23:53:24 GMT–Neural Information Processing Systems

We train the Kinetics model for 30 epochs with the base learning rate of 0.05 with the Momentum Basically, all the settings in our Kinetics experiments follow the setting of ViViT. We provide the training details as below. We use the cosine-decay learning rate which was popularly used in many video CNN model trainings. The base learning rate of 0.8 per TPU core (which is equivalent to a single GPU) is used for the Charades dataset (multi-label action classification) and the base rate of 0.025 per TPU was used for A ViD. Label smoothing of 0.2 was used for the A ViD training. In Charades, the training was done by temporally cropping a long Charades videos (e.g., In all these experiments, ViT L/16 model was used.

artificial intelligence, machine learning, transformer, (15 more...)

Neural Information Processing Systems

Aug-14-2025, 23:53:24 GMT

Conferences PDF

Add feedback

Country:
- North America > United States > New York > Suffolk County > Stony Brook (0.05)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (1.00)