Feng, Xiaokun
How Texts Help? A Fine-grained Evaluation to Reveal the Role of Language in Vision-Language Tracking
Li, Xuchen, Hu, Shiyu, Feng, Xiaokun, Zhang, Dailing, Wu, Meiqi, Zhang, Jing, Huang, Kaiqi
Vision-language tracking (VLT) extends traditional single object tracking by incorporating textual information, providing semantic guidance to enhance tracking performance under challenging conditions like fast motion and deformations. However, current VLT trackers often underperform compared to single-modality methods on multiple benchmarks, with semantic information sometimes becoming a "distraction." To address this, we propose VLTVerse, the first fine-grained evaluation framework for VLT trackers that comprehensively considers multiple challenge factors and diverse semantic information, hoping to reveal the role of language in VLT. Our contributions include: (1) VLTVerse introduces 10 sequence-level challenge labels and 6 types of multi-granularity semantic information, creating a flexible and multi-dimensional evaluation space for VLT; (2) leveraging 60 subspaces formed by combinations of challenge factors and semantic types, we conduct systematic fine-grained evaluations of three mainstream SOTA VLT trackers, uncovering their performance bottlenecks across complex scenarios and offering a novel perspective on VLT evaluation; (3) through decoupled analysis of experimental results, we examine the impact of various semantic types on specific challenge factors in relation to different algorithms, providing essential guidance for enhancing VLT across data, evaluation, and algorithmic dimensions. The VLTVerse, toolkit, and results will be available at \url{http://metaverse.aitestunion.com}.
DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM
Li, Xuchen, Hu, Shiyu, Feng, Xiaokun, Zhang, Dailing, Wu, Meiqi, Zhang, Jing, Huang, Kaiqi
Visual language tracking (VLT) has emerged as a cutting-edge research area, harnessing linguistic data to enhance algorithms with multi-modal inputs and broadening the scope of traditional single object tracking (SOT) to encompass video understanding applications. Despite this, most VLT benchmarks still depend on succinct, human-annotated text descriptions for each video. These descriptions often fall short in capturing the nuances of video content dynamics and lack stylistic variety in language, constrained by their uniform level of detail and a fixed annotation frequency. As a result, algorithms tend to default to a "memorize the answer" strategy, diverging from the core objective of achieving a deeper understanding of video content. Fortunately, the emergence of large language models (LLMs) has enabled the generation of diverse text. This work utilizes LLMs to generate varied semantic annotations (in terms of text lengths and granularities) for representative SOT benchmarks, thereby establishing a novel multi-modal benchmark. Specifically, we (1) propose a new visual language tracking benchmark with diverse texts, named DTVLT, based on five prominent VLT and SOT benchmarks, including three sub-tasks: short-term tracking, long-term tracking, and global instance tracking. (2) We offer four granularity texts in our benchmark, considering the extent and density of semantic information. We expect this multi-granular generation strategy to foster a favorable environment for VLT and video understanding research. (3) We conduct comprehensive experimental analyses on DTVLT, evaluating the impact of diverse text on tracking performance and hope the identified performance bottlenecks of existing algorithms can support further research in VLT and video understanding. The proposed benchmark, experimental results and toolkit will be released gradually on http://videocube.aitestunion.com/.
Split Semantic Detection in Sandplay Images
Feng, Xiaokun, Chen, Xiaotang, Jia, Jian, Huang, Kaiqi
Sandplay image, as an important psychoanalysis carrier, is a visual scene constructed by the client selecting and placing sand objects (e.g., sand, river, human figures, animals, vegetation, buildings, etc.). As the projection of the client's inner world, it contains high-level semantic information reflecting the client's subjective psychological states, which is different from the common natural image scene that only contains the objective basic semantics (e.g., object's name, attribute, bounding box, etc.). In this work, we take "split" which is a typical psychological semantics related to many emotional and personality problems as the research goal, and we propose an automatic detection model, which can replace the time-consuming and expensive manual analysis process. To achieve that, we design a distribution map generation method projecting the semantic judgment problem into a visual problem, and a feature dimensionality reduction and extraction algorithm which can provide a good representation of split semantics. Besides, we built a sandplay datasets by collecting one sample from each client and inviting 5 therapists to label each sample, which has a large data cost. Experimental results demonstrated the effectiveness of our proposed method.
See Your Heart: Psychological states Interpretation through Visual Creations
Yang, Likun, Feng, Xiaokun, Chen, Xiaotang, Zhang, Shiyu, Huang, Kaiqi
In psychoanalysis, generating interpretations to one's psychological state through visual creations is facing significant demands. The two main tasks of existing studies in the field of computer vision, sentiment/emotion classification and affective captioning, can hardly satisfy the requirement of psychological interpreting. To meet the demands for psychoanalysis, we introduce a challenging task, \textbf{V}isual \textbf{E}motion \textbf{I}nterpretation \textbf{T}ask (VEIT). VEIT requires AI to generate reasonable interpretations of creator's psychological state through visual creations. To support the task, we present a multimodal dataset termed SpyIn (\textbf{S}and\textbf{p}la\textbf{y} \textbf{In}terpretation Dataset), which is psychological theory supported and professional annotated. Dataset analysis illustrates that SpyIn is not only able to support VEIT, but also more challenging compared with other captioning datasets. Building on SpyIn, we conduct experiments of several image captioning method, and propose a visual-semantic combined model which obtains a SOTA result on SpyIn. The results indicate that VEIT is a more challenging task requiring scene graph information and psychological knowledge. Our work also show a promise for AI to analyze and explain inner world of humanity through visual creations.