UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces

Zhao, Baining, Fang, Jianjie, Dai, Zichao, Wang, Ziyou, Zha, Jirong, Zhang, Weichen, Gao, Chen, Wang, Yue, Cui, Jinqiang, Chen, Xinlei, Li, Yong

Mar-8-2025–arXiv.org Artificial Intelligence

Large multimodal models exhibit remarkable intelligence, yet their embodied cognitive abilities during motion in open-ended urban 3D space remain to be explored. We introduce a benchmark to evaluate whether video-large language models (Video-LLMs) can naturally process continuous first-person visual observations like humans, enabling recall, perception, reasoning, and navigation. We have manually control drones to collect 3D embodied motion video data from real-world cities and simulated environments, resulting in 1.5k video clips. Then we design a pipeline to generate 5.2k multiple-choice questions. Evaluations of 17 widely-used Video-LLMs reveal current limitations in urban embodied cognition. Correlation analysis provides insight into the relationships between different tasks, showing that causal reasoning has a strong correlation with recall, perception, and navigation, while the abilities for counterfactual and associative reasoning exhibit lower correlation with other tasks. We also validate the potential for Sim-to-Real transfer in urban embodiment through fine-tuning.

choice, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

Mar-8-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China > Guangdong Province (0.28)

Genre:
- Research Report (1.00)

Industry:
- Health & Medicine (0.76)
- Transportation
  - Ground > Road (1.00)
  - Infrastructure & Services (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.69)
  - Natural Language > Large Language Model (1.00)
  - Robots > Autonomous Vehicles
    - Drones (0.46)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found