AITopics | Feng, Xiaokun

Collaborating Authors

Feng, Xiaokun

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

How Texts Help? A Fine-grained Evaluation to Reveal the Role of Language in Vision-Language Tracking

Li, Xuchen, Hu, Shiyu, Feng, Xiaokun, Zhang, Dailing, Wu, Meiqi, Zhang, Jing, Huang, Kaiqi

arXiv.org Artificial IntelligenceNov-23-2024

Vision-language tracking (VLT) extends traditional single object tracking by incorporating textual information, providing semantic guidance to enhance tracking performance under challenging conditions like fast motion and deformations. However, current VLT trackers often underperform compared to single-modality methods on multiple benchmarks, with semantic information sometimes becoming a "distraction." To address this, we propose VLTVerse, the first fine-grained evaluation framework for VLT trackers that comprehensively considers multiple challenge factors and diverse semantic information, hoping to reveal the role of language in VLT. Our contributions include: (1) VLTVerse introduces 10 sequence-level challenge labels and 6 types of multi-granularity semantic information, creating a flexible and multi-dimensional evaluation space for VLT; (2) leveraging 60 subspaces formed by combinations of challenge factors and semantic types, we conduct systematic fine-grained evaluations of three mainstream SOTA VLT trackers, uncovering their performance bottlenecks across complex scenarios and offering a novel perspective on VLT evaluation; (3) through decoupled analysis of experimental results, we examine the impact of various semantic types on specific challenge factors in relation to different algorithms, providing essential guidance for enhancing VLT across data, evaluation, and algorithmic dimensions. The VLTVerse, toolkit, and results will be available at \url{http://metaverse.aitestunion.com}.

natural language, text processing, vision-language tracking, (3 more...)

arXiv.org Artificial Intelligence

2411.156

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback

DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM

Li, Xuchen, Hu, Shiyu, Feng, Xiaokun, Zhang, Dailing, Wu, Meiqi, Zhang, Jing, Huang, Kaiqi

arXiv.org Artificial IntelligenceOct-9-2024

Visual language tracking (VLT) has emerged as a cutting-edge research area, harnessing linguistic data to enhance algorithms with multi-modal inputs and broadening the scope of traditional single object tracking (SOT) to encompass video understanding applications. Despite this, most VLT benchmarks still depend on succinct, human-annotated text descriptions for each video. These descriptions often fall short in capturing the nuances of video content dynamics and lack stylistic variety in language, constrained by their uniform level of detail and a fixed annotation frequency. As a result, algorithms tend to default to a "memorize the answer" strategy, diverging from the core objective of achieving a deeper understanding of video content. Fortunately, the emergence of large language models (LLMs) has enabled the generation of diverse text. This work utilizes LLMs to generate varied semantic annotations (in terms of text lengths and granularities) for representative SOT benchmarks, thereby establishing a novel multi-modal benchmark. Specifically, we (1) propose a new visual language tracking benchmark with diverse texts, named DTVLT, based on five prominent VLT and SOT benchmarks, including three sub-tasks: short-term tracking, long-term tracking, and global instance tracking. (2) We offer four granularity texts in our benchmark, considering the extent and density of semantic information. We expect this multi-granular generation strategy to foster a favorable environment for VLT and video understanding research. (3) We conduct comprehensive experimental analyses on DTVLT, evaluating the impact of diverse text on tracking performance and hope the identified performance bottlenecks of existing algorithms can support further research in VLT and video understanding. The proposed benchmark, experimental results and toolkit will be released gradually on http://videocube.aitestunion.com/.

benchmark, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2410.02492

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Split Semantic Detection in Sandplay Images

Feng, Xiaokun, Chen, Xiaotang, Jia, Jian, Huang, Kaiqi

arXiv.org Artificial IntelligenceNov-8-2023

Sandplay image, as an important psychoanalysis carrier, is a visual scene constructed by the client selecting and placing sand objects (e.g., sand, river, human figures, animals, vegetation, buildings, etc.). As the projection of the client's inner world, it contains high-level semantic information reflecting the client's subjective psychological states, which is different from the common natural image scene that only contains the objective basic semantics (e.g., object's name, attribute, bounding box, etc.). In this work, we take "split" which is a typical psychological semantics related to many emotional and personality problems as the research goal, and we propose an automatic detection model, which can replace the time-consuming and expensive manual analysis process. To achieve that, we design a distribution map generation method projecting the semantic judgment problem into a visual problem, and a feature dimensionality reduction and extraction algorithm which can provide a good representation of split semantics. Besides, we built a sandplay datasets by collecting one sample from each client and inviting 5 therapists to label each sample, which has a large data cost. Experimental results demonstrated the effectiveness of our proposed method.

artificial intelligence, sandplay image, split semantic detection

arXiv.org Artificial Intelligence

2203.00907

Genre: Research Report (0.66)

Industry: Health & Medicine (0.53)

Technology: Information Technology > Artificial Intelligence (0.73)

Add feedback

See Your Heart: Psychological states Interpretation through Visual Creations

Yang, Likun, Feng, Xiaokun, Chen, Xiaotang, Zhang, Shiyu, Huang, Kaiqi

arXiv.org Artificial IntelligenceMar-16-2023

In psychoanalysis, generating interpretations to one's psychological state through visual creations is facing significant demands. The two main tasks of existing studies in the field of computer vision, sentiment/emotion classification and affective captioning, can hardly satisfy the requirement of psychological interpreting. To meet the demands for psychoanalysis, we introduce a challenging task, \textbf{V}isual \textbf{E}motion \textbf{I}nterpretation \textbf{T}ask (VEIT). VEIT requires AI to generate reasonable interpretations of creator's psychological state through visual creations. To support the task, we present a multimodal dataset termed SpyIn (\textbf{S}and\textbf{p}la\textbf{y} \textbf{In}terpretation Dataset), which is psychological theory supported and professional annotated. Dataset analysis illustrates that SpyIn is not only able to support VEIT, but also more challenging compared with other captioning datasets. Building on SpyIn, we conduct experiments of several image captioning method, and propose a visual-semantic combined model which obtains a SOTA result on SpyIn. The results indicate that VEIT is a more challenging task requiring scene graph information and psychological knowledge. Our work also show a promise for AI to analyze and explain inner world of humanity through visual creations.

artificial intelligence, psychological state interpretation, visual creation

arXiv.org Artificial Intelligence

2302.10276

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Vision (0.53)

Add feedback