VideoOrion: Tokenizing Object Dynamics in Videos
Feng, Yicheng, Li, Yijiang, Zhang, Wanpeng, Zheng, Sipeng, Lu, Zongqing
–arXiv.org Artificial Intelligence
VideoOrion not only offers a more natural and efficient way to derive compact, disentangled semantic representations We present VideoOrion, a Video Large Language Model but also enables explicit object modeling of video (Video-LLM) that explicitly captures the key semantic information content with minimal computational cost. Moreover, the introduced in videos--the spatial-temporal dynamics of objects object tokens naturally allow VideoOrion to accomplish throughout the videos. VideoOrion employs expert vision video-based referring tasks. Experimental results models to extract object dynamics through a detectsegment-track show that VideoOrion can learn to make good use of the pipeline, encoding them into a set of object object tokens, and achieves competitive results on both general tokens by aggregating spatial-temporal object features. Our video question answering and video-based referring method addresses the persistent challenge in Video-LLMs benchmarks. of efficiently compressing high-dimensional video data into semantic tokens that are comprehensible to LLMs.
arXiv.org Artificial Intelligence
Nov-25-2024
- Country:
- Africa > Angola
- Namibe Province > South Atlantic Ocean (0.04)
- Asia > China
- North America > United States
- California > San Diego County > San Diego (0.04)
- Africa > Angola
- Genre:
- Research Report (0.82)
- Technology: