ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment

Aghdam, Amir, Hu, Vincent Tao, Ommer, Björn

Oct-21-2025–arXiv.org Artificial Intelligence

We address the task of zero-shot video classification for extremely fine-grained actions (e.g., Windmill Dunk in basketball), where no video examples or temporal annotations are available for unseen classes. While image-language models (e.g., CLIP, SigLIP) show strong open-set recognition, they lack temporal modeling needed for video understanding. We propose ActAlign, a truly zero-shot, training-free method that formulates video classification as a sequence alignment problem, preserving the generalization strength of pretrained image-language models. For each class, a large language model (LLM) generates an ordered sequence of sub-actions, which we align with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on ActionAtlas--the most diverse benchmark of fine-grained actions across multiple sports--where human performance is only 61.6%. ActAlign outperforms billion-parameter video-language models while using 8x fewer parameters. Our approach is model-agnostic and domain-general, demonstrating that structured language priors combined with classical alignment methods can unlock the open-set recognition potential of image-language models for fine-grained video understanding.

large language model, machine learning, recognition, (19 more...)

arXiv.org Artificial Intelligence

Oct-21-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.68)
- North America > United States (0.46)

Genre:
- Research Report (0.50)

Industry:
- Leisure & Entertainment > Sports (0.68)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.61)
- Education (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Vision > Video Understanding (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found