VideoOrion: Tokenizing Object Dynamics in Videos

Feng, Yicheng, Li, Yijiang, Zhang, Wanpeng, Zheng, Sipeng, Lu, Zongqing

Nov-25-2024–arXiv.org Artificial Intelligence

VideoOrion not only offers a more natural and efficient way to derive compact, disentangled semantic representations We present VideoOrion, a Video Large Language Model but also enables explicit object modeling of video (Video-LLM) that explicitly captures the key semantic information content with minimal computational cost. Moreover, the introduced in videos--the spatial-temporal dynamics of objects object tokens naturally allow VideoOrion to accomplish throughout the videos. VideoOrion employs expert vision video-based referring tasks. Experimental results models to extract object dynamics through a detectsegment-track show that VideoOrion can learn to make good use of the pipeline, encoding them into a set of object object tokens, and achieves competitive results on both general tokens by aggregating spatial-temporal object features. Our video question answering and video-based referring method addresses the persistent challenge in Video-LLMs benchmarks. of efficiently compressing high-dimensional video data into semantic tokens that are comprehensible to LLMs.

object-centric branch, video, videoorion, (13 more...)

arXiv.org Artificial Intelligence

Nov-25-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - California > San Diego County > San Diego (0.04)
- Asia > China
  - Beijing > Beijing (0.04)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.96)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found