ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment
Aghdam, Amir, Hu, Vincent Tao, Ommer, Björn
–arXiv.org Artificial Intelligence
We address the task of zero-shot video classification for extremely fine-grained actions (e.g., Windmill Dunk in basketball), where no video examples or temporal annotations are available for unseen classes. While image-language models (e.g., CLIP, SigLIP) show strong open-set recognition, they lack temporal modeling needed for video understanding. We propose ActAlign, a truly zero-shot, training-free method that formulates video classification as a sequence alignment problem, preserving the generalization strength of pretrained image-language models. For each class, a large language model (LLM) generates an ordered sequence of sub-actions, which we align with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on ActionAtlas--the most diverse benchmark of fine-grained actions across multiple sports--where human performance is only 61.6%. ActAlign outperforms billion-parameter video-language models while using 8x fewer parameters. Our approach is model-agnostic and domain-general, demonstrating that structured language priors combined with classical alignment methods can unlock the open-set recognition potential of image-language models for fine-grained video understanding.
arXiv.org Artificial Intelligence
Oct-21-2025
- Country:
- Europe (0.68)
- North America > United States (0.46)
- Genre:
- Research Report (0.50)
- Industry:
- Leisure & Entertainment > Sports (0.68)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.61)
- Education (0.46)
- Technology: