Supplementary Materials for V
–Neural Information Processing Systems
In this appendix, we start with describing the experimental setup details (Sec. Each sampled video from 30K has on average around 100 clips. To investigate whether the additional MLP distillation head (Sec.3.3 in the main paper) affects the As we see in Table 1, for both NST and CRD, the performance drops on all downstream tasks when distillation heads are removed. Table 1: Ablation results of additional distillation heads for student language models.SST -2 QNLI QQP MNLI BERT In Table 2, we compare the accuracy of text-only pretraining, image-based KD and video-based KD on PIQA. KD further improves the results.
Neural Information Processing Systems
Nov-15-2025, 17:22:00 GMT