Text-Derived Knowledge Helps Vision: A Simple Cross-modal Distillation for Video-based Action Anticipation
Ghosh, Sayontan, Aggarwal, Tanvi, Hoai, Minh, Balasubramanian, Niranjan
–arXiv.org Artificial Intelligence
Anticipating future actions in a video is useful for many autonomous and assistive technologies. Most prior action anticipation work treat this as a vision modality problem, where the models learn the task information primarily from the video features in the action anticipation datasets. However, knowledge about action sequences can also be obtained from external textual data. In this work, we show how knowledge in pretrained language models can be adapted and distilled into vision-based action anticipation models. We Figure 1: A model learning the action anticipation from show that a simple distillation technique can only the vision modality (video frames) is essentially achieve effective knowledge transfer and provide exposed to a very limited set of action sequences. Language consistent gains on a strong vision model models, which are pre-trained on large-scale text, (Anticipative Vision Transformer) for two action can learn this distribution from the task, and a much anticipation datasets (3.5% relative gain larger domain-relevant text. We propose distilling this on EGTEA-GAZE+ and 7.2% relative gain on knowledge from text modality models to vision modality EPIC-KITCHEN 55), giving a new state-of-theart model for video action anticipation task.
arXiv.org Artificial Intelligence
Feb-21-2023
- Country:
- North America > United States > Minnesota (0.28)
- Genre:
- Research Report (0.50)
- Workflow (0.58)
- Industry:
- Education (0.47)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.46)
- Natural Language (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence