Text-Derived Knowledge Helps Vision: A Simple Cross-modal Distillation for Video-based Action Anticipation

Open in new window