Training-free Online Video Step Grounding

Jun-19-2026, 17:39:17 GMT–Neural Information Processing Systems

Given a task and a set of steps composing it, Video Step Grounding (VSG) aims to detect which steps are performed in a video. Standard approaches for this task require a labeled training set (e.g., with step-level annotations or narrations), which may be costly to collect. Moreover, they process the full video offline, limiting their applications for scenarios requiring online decisions. Thus, in this work, we explore how to perform VSG online and without training. We achieve this by exploiting the zero-shot capabilities of recent Large Multimodal Models (LMMs).

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Jun-19-2026, 17:39:17 GMT

Conferences PDF

Add feedback

Country:
- Europe (0.28)

Genre:
- Workflow (1.00)
- Research Report
  - Experimental Study (1.00)
  - New Finding (0.93)

Industry:
- Education > Educational Setting > Online (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning
    - Neural Networks > Deep Learning (0.69)
    - Performance Analysis > Accuracy (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found