Goto

Collaborating Authors

 spatial task


SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models

Chen, Pingyi, Lou, Yujing, Cao, Shen, Guo, Jinhui, Fan, Lubin, Wu, Yue, Yang, Lin, Ma, Lizhuang, Ye, Jieping

arXiv.org Artificial Intelligence

While vision language models (VLMs) excel in 2D semantic visual understanding, their ability to quantitatively reason about 3D spatial relationships remains under-explored, due to the deficiency of 2D images' spatial representation ability. In this paper, we analyze the problem hindering VLMs' spatial understanding abilities and propose SD-VLM, a novel framework that significantly enhances fundamental spatial perception abilities of VLMs through two key contributions: (1) propose Massive Spatial Measuring and Understanding (MSMU) dataset with precise spatial annotations, and (2) introduce a simple depth positional encoding method strengthening VLMs' spatial awareness. MSMU dataset covers massive quantitative spatial tasks with 700K QA pairs, 2.5M physical numerical annotations, and 10K chain-of-thought augmented samples. We have trained SD-VLM, a strong generalist VLM which shows superior quantitative spatial measuring and understanding capability. SD-VLM not only achieves state-of-the-art performance on our proposed MSMU-Bench, but also shows spatial generalization abilities on other spatial understanding benchmarks including Q-Spatial and SpatialRGPT-Bench. Extensive experiments demonstrate that SD-VLM outperforms GPT-4o and Intern-VL3-78B by 26.91% and 25.56% respectively on MSMU-Bench. Code and models are released at https://github.com/cpystan/SD-VLM.


Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study

Xu, Liuchang, Zhao, Shuo, Lin, Qingming, Chen, Luyao, Luo, Qianqian, Wu, Sensen, Ye, Xinyue, Feng, Hailin, Du, Zhenhong

arXiv.org Artificial Intelligence

The advent of large language models such as ChatGPT, Gemini, and others has underscored the importance of evaluating their diverse capabilities, ranging from natural language understanding to code generation. However, their performance on spatial tasks has not been comprehensively assessed. This study addresses this gap by introducing a novel multi-task spatial evaluation dataset, designed to systematically explore and compare the performance of several advanced models on spatial tasks. The dataset encompasses twelve distinct task types, including spatial understanding and path planning, each with verified, accurate answers. We evaluated multiple models, including OpenAI's gpt-3.5-turbo, gpt-4o, and ZhipuAI's glm-4, through a two-phase testing approach. Initially, we conducted zero-shot testing, followed by categorizing the dataset by difficulty and performing prompt tuning tests. Results indicate that gpt-4o achieved the highest overall accuracy in the first phase, with an average of 71.3%. Although moonshot-v1-8k slightly underperformed overall, it surpassed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on model performance in specific tasks. For example, the Chain-of-Thought (COT) strategy increased gpt-4o's accuracy in path planning from 12.4% to 87.5%, while a one-shot strategy enhanced moonshot-v1-8k's accuracy in mapping tasks from 10.1% to 76.3%.


Referee-Meta-Learning for Fast Adaptation of Locational Fairness

Chen, Weiye, Xie, Yiqun, Jia, Xiaowei, He, Erhu, Bao, Han, An, Bang, Zhou, Xun

arXiv.org Artificial Intelligence

When dealing with data from distinct locations, machine learning algorithms tend to demonstrate an implicit preference of some locations over the others, which constitutes biases that sabotage the spatial fairness of the algorithm. This unfairness can easily introduce biases in subsequent decision-making given broad adoptions of learning-based solutions in practice. However, locational biases in AI are largely understudied. To mitigate biases over locations, we propose a locational meta-referee (Meta-Ref) to oversee the few-shot meta-training and meta-testing of a deep neural network. Meta-Ref dynamically adjusts the learning rates for training samples of given locations to advocate a fair performance across locations, through an explicit consideration of locational biases and the characteristics of input data. We present a three-phase training framework to learn both a meta-learning-based predictor and an integrated Meta-Ref that governs the fairness of the model. Once trained with a distribution of spatial tasks, Meta-Ref is applied to samples from new spatial tasks (i.e., regions outside the training area) to promote fairness during the fine-tune step. We carried out experiments with two case studies on crop monitoring and transportation safety, which show Meta-Ref can improve locational fairness while keeping the overall prediction quality at a similar level.


Why It's Notoriously Difficult to Compare AI and Human Perception

#artificialintelligence

Science fiction is becoming reality as increasingly intelligent machines are gradually emerging -- ones that not only specialize in things like chess, but that can also carry out higher-level reasoning, or even answer deep philosophical questions. For the past few decades, experts have been collectively bending their efforts toward the creation of such a human-like artificial intelligence, or a so-called "strong" or artificial general intelligence (AGI), which can learn to perform a wide range of tasks as easily as a human might. But while current AI development may take some inspiration from the neuroscience of the human brain, is it actually appropriate to compare the way AI processes information with the way humans do it? The answer to that question depends on how experiments are set up, and how AI models are structured and trained, according to new research from a team of German researchers from the University of Tübingen and other research institutes. The team's study suggests that because of the differences between the way AI and humans arrive at such decisions, any generalizations from such a comparison may not be completely reliable, especially if machines are used to automate critical tasks.


Why It's Notoriously Difficult to Compare AI and Human Perception

#artificialintelligence

Science fiction is becoming reality as increasingly intelligent machines are gradually emerging -- ones that not only specialize in things like chess, but that can also carry out higher-level reasoning, or even answer deep philosophical questions. For the past few decades, experts have been collectively bending their efforts toward the creation of such a human-like artificial intelligence, or a so-called "strong" or artificial general intelligence (AGI), which can learn to perform a wide range of tasks as easily as a human might. But while current AI development may take some inspiration from the neuroscience of the human brain, is it actually appropriate to compare the way AI processes information with the way humans do it? The answer to that question depends on how experiments are set up, and how AI models are structured and trained, according to new research from a team of German researchers from the University of Tübingen and other research institutes. The team's study suggests that because of the differences between the way AI and humans arrive at such decisions, any generalizations from such a comparison may not be completely reliable, especially if machines are used to automate critical tasks.