Shifted Chunk Transformer for Spatio-Temporal Representational Learning

Oct-10-2024, 16:15:13 GMT–Neural Information Processing Systems

Spatio-temporal representational learning has been widely adopted in various fields such as action recognition, video object segmentation, and action anticipation.Previous spatio-temporal representational learning approaches primarily employ ConvNets or sequential models, e.g., LSTM, to learn the intra-frame and inter-frame features. Recently, Transformer models have successfully dominated the study of natural language processing (NLP), image classification, etc. However, the pure-Transformer based spatio-temporal learning can be prohibitively costly on memory and computation to extract fine-grained features from a tiny patch. To tackle the training difficulty and enhance the spatio-temporal learning, we construct a shifted chunk Transformer with pure self-attention blocks. Leveraging the recent efficient Transformer design in NLP, this shifted chunk Transformer can learn hierarchical spatio-temporal features from a local tiny patch to a global videoclip.

shifted chunk transformer, spatio-temporal representational learning, transformer, (2 more...)

Neural Information Processing Systems

Oct-10-2024, 16:15:13 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.87)