Text-Derived Knowledge Helps Vision: A Simple Cross-modal Distillation for Video-based Action Anticipation

Ghosh, Sayontan, Aggarwal, Tanvi, Hoai, Minh, Balasubramanian, Niranjan

Feb-21-2023–arXiv.org Artificial Intelligence

Anticipating future actions in a video is useful for many autonomous and assistive technologies. Most prior action anticipation work treat this as a vision modality problem, where the models learn the task information primarily from the video features in the action anticipation datasets. However, knowledge about action sequences can also be obtained from external textual data. In this work, we show how knowledge in pretrained language models can be adapted and distilled into vision-based action anticipation models. We Figure 1: A model learning the action anticipation from show that a simple distillation technique can only the vision modality (video frames) is essentially achieve effective knowledge transfer and provide exposed to a very limited set of action sequences. Language consistent gains on a strong vision model models, which are pre-trained on large-scale text, (Anticipative Vision Transformer) for two action can learn this distribution from the task, and a much anticipation datasets (3.5% relative gain larger domain-relevant text. We propose distilling this on EGTEA-GAZE+ and 7.2% relative gain on knowledge from text modality models to vision modality EPIC-KITCHEN 55), giving a new state-of-theart model for video action anticipation task.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Feb-21-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States > Minnesota (0.28)

Genre:
- Research Report (0.50)
- Workflow (0.58)

Industry:
- Education (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.46)
  - Natural Language (1.00)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found