VideoOrion: Tokenizing Object Dynamics in Videos

Feng, Yicheng, Li, Yijiang, Zhang, Wanpeng, Zheng, Sipeng, Lu, Zongqing

arXiv.org Artificial Intelligence 

VideoOrion not only offers a more natural and efficient way to derive compact, disentangled semantic representations We present VideoOrion, a Video Large Language Model but also enables explicit object modeling of video (Video-LLM) that explicitly captures the key semantic information content with minimal computational cost. Moreover, the introduced in videos--the spatial-temporal dynamics of objects object tokens naturally allow VideoOrion to accomplish throughout the videos. VideoOrion employs expert vision video-based referring tasks. Experimental results models to extract object dynamics through a detectsegment-track show that VideoOrion can learn to make good use of the pipeline, encoding them into a set of object object tokens, and achieves competitive results on both general tokens by aggregating spatial-temporal object features. Our video question answering and video-based referring method addresses the persistent challenge in Video-LLMs benchmarks. of efficiently compressing high-dimensional video data into semantic tokens that are comprehensible to LLMs.