AITopics | Li, Xiangtai

Collaborating Authors

Li, Xiangtai

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

UMC: Unified Resilient Controller for Legged Robots with Joint Malfunctions

Qiu, Yu, Lin, Xin, Wang, Jingbo, Li, Xiangtai, Qi, Lu, Yang, Ming-Hsuan

arXiv.org Artificial IntelligenceFeb-5-2025

Adaptation to unpredictable damages is crucial for autonomous legged robots, yet existing methods based on multi-policy or meta-learning frameworks face challenges like limited generalization and complex maintenance. To address this issue, we first analyze and summarize eight types of damage scenarios, including sensor failures and joint malfunctions. Then, we propose a novel, model-free, two-stage training framework, Unified Malfunction Controller (UMC), incorporating a masking mechanism to enhance damage resilience. Specifically, the model is initially trained with normal environments to ensure robust performance under standard conditions. In the second stage, we use masks to prevent the legged robot from relying on malfunctioning limbs, enabling adaptive gait and movement adjustments upon malfunction. Experimental results demonstrate that our approach improves the task completion capability by an average of 36% for the transformer and 39% for the MLP across three locomotion tasks. The source code and trained models will be made available to the public.

artificial intelligence, machine learning, umc 0, (13 more...)

arXiv.org Artificial Intelligence

2502.03035

Country:

North America > United States > California (0.14)
Asia > China (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Robots > Locomotion (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model

Huang, Zhenglin, Hu, Jinwei, Li, Xiangtai, He, Yiwei, Zhao, Xingyu, Peng, Bei, Wu, Baoyuan, Huang, Xiaowei, Cheng, Guangliang

arXiv.org Artificial IntelligenceDec-5-2024

The rapid advancement of generative models in creating highly realistic images poses substantial risks for misinformation dissemination. For instance, a synthetic image, when shared on social media, can mislead extensive audiences and erode trust in digital content, resulting in severe repercussions. Despite some progress, academia has not yet created a large and diversified deepfake detection dataset for social media, nor has it devised an effective solution to address this issue. In this paper, we introduce the Social media Image Detection dataSet (SID-Set), which offers three key advantages: (1) extensive volume, featuring 300K AI-generated/tampered and authentic images with comprehensive annotations, (2) broad diversity, encompassing fully synthetic and tampered images across various classes, and (3) elevated realism, with images that are predominantly indistinguishable from genuine ones through mere visual inspection. Furthermore, leveraging the exceptional capabilities of large multimodal models, we propose a new image deepfake detection, localization, and explanation framework, named SIDA (Social media Image Detection, localization, and explanation Assistant). SIDA not only discerns the authenticity of images, but also delineates tampered regions through mask prediction and provides textual explanations of the model's judgment criteria. Compared with state-of-the-art deepfake detection models on SID-Set and other benchmarks, extensive experiments demonstrate that SIDA achieves superior performance among diversified settings. The code, model, and dataset will be released.

detection, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2412.04292

Country: Asia > China > Guangdong Province (0.14)

Genre: Research Report (0.50)

Industry:

Transportation > Passenger (1.00)
Information Technology > Security & Privacy (1.00)
Leisure & Entertainment > Sports (0.93)
Transportation > Ground > Road (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(2 more...)

Add feedback

Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image

Zhao, Yu, Fei, Hao, Li, Xiangtai, Qin, Libo, Ji, Jiayi, Zhu, Hongyuan, Zhang, Meishan, Zhang, Min, Wei, Jianguo

arXiv.org Artificial IntelligenceOct-20-2024

In the visual spatial understanding (VSU) area, spatial image-to-text (SI2T) and spatial text-to-image (ST2I) are two fundamental tasks that appear in dual form. Existing methods for standalone SI2T or ST2I perform imperfectly in spatial understanding, due to the difficulty of 3D-wise spatial feature modeling. In this work, we consider modeling the SI2T and ST2I together under a dual learning framework. During the dual framework, we then propose to represent the 3D spatial scene features with a novel 3D scene graph (3DSG) representation that can be shared and beneficial to both tasks. Further, inspired by the intuition that the easier 3D$\to$image and 3D$\to$text processes also exist symmetrically in the ST2I and SI2T, respectively, we propose the Spatial Dual Discrete Diffusion (SD$^3$) framework, which utilizes the intermediate features of the 3D$\to$X processes to guide the hard X$\to$3D processes, such that the overall ST2I and SI2T will benefit each other. On the visual spatial understanding dataset VSD, our system outperforms the mainstream T2I and I2T methods significantly. Further in-depth analysis reveals how our dual learning strategy advances.

large language model, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2410.15312

Country:

North America > United States (1.00)
Europe (1.00)
Asia > China (0.67)

Genre: Research Report (0.82)

Industry: Transportation > Ground > Rail (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
(2 more...)

Add feedback

4D Panoptic Scene Graph Generation

Yang, Jingkang, Cen, Jun, Peng, Wenxuan, Liu, Shuai, Hong, Fangzhou, Li, Xiangtai, Zhou, Kaiyang, Chen, Qifeng, Liu, Ziwei

arXiv.org Artificial IntelligenceMay-16-2024

We are living in a three-dimensional space while moving forward through a fourth dimension: time. To allow artificial intelligence to develop a comprehensive understanding of such a 4D environment, we introduce 4D Panoptic Scene Graph (PSG-4D), a new representation that bridges the raw visual data perceived in a dynamic 4D world and high-level visual understanding. Specifically, PSG-4D abstracts rich 4D sensory data into nodes, which represent entities with precise location and status information, and edges, which capture the temporal relations. To facilitate research in this new area, we build a richly annotated PSG-4D dataset consisting of 3K RGB-D videos with a total of 1M frames, each of which is labeled with 4D panoptic segmentation masks as well as fine-grained, dynamic scene graphs. To solve PSG-4D, we propose PSG4DFormer, a Transformer-based model that can predict panoptic segmentation masks, track masks along the time axis, and generate the corresponding scene graphs via a relation component. Extensive experiments on the new dataset show that our method can serve as a strong baseline for future research on PSG-4D. In the end, we provide a real-world application example to demonstrate how we can achieve dynamic scene understanding by integrating a large language model into our PSG-4D system.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2405.10305

Country:

Asia > China (0.28)
Asia > Middle East > Israel (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

VG4D: Vision-Language Model Goes 4D Video Recognition

Deng, Zhichao, Li, Xiangtai, Li, Xia, Tong, Yunhai, Zhao, Shen, Liu, Mengyuan

arXiv.org Artificial IntelligenceApr-17-2024

Understanding the real world through point cloud video is a crucial aspect of robotics and autonomous driving systems. However, prevailing methods for 4D point cloud recognition have limitations due to sensor resolution, which leads to a lack of detailed information. Recent advances have shown that Vision-Language Models (VLM) pre-trained on web-scale text-image datasets can learn fine-grained visual concepts that can be transferred to various downstream tasks. However, effectively integrating VLM into the domain of 4D point clouds remains an unresolved problem. In this work, we propose the Vision-Language Models Goes 4D (VG4D) framework to transfer VLM knowledge from visual-text pre-trained models to a 4D point cloud network. Our approach involves aligning the 4D encoder's representation with a VLM to learn a shared visual and text space from training on large-scale image-text pairs. By transferring the knowledge of the VLM to the 4D encoder and combining the VLM, our VG4D achieves improved recognition performance. To enhance the 4D encoder, we modernize the classic dynamic point cloud backbone and propose an improved version of PSTNet, im-PSTNet, which can efficiently model point cloud videos. Experiments demonstrate that our method achieves state-of-the-art performance for action recognition on both the NTU RGB+D 60 dataset and the NTU RGB+D 120 dataset. Code is available at \url{https://github.com/Shark0-0/VG4D}.

artificial intelligence, point cloud, recognition, (16 more...)

arXiv.org Artificial Intelligence

2404.11605

Country: Asia > China (0.28)

Genre: Research Report (0.82)

Industry: Information Technology (0.34)

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback

Generalizable Entity Grounding via Assistance of Large Language Model

Qi, Lu, Chen, Yi-Wen, Yang, Lehan, Shen, Tiancheng, Li, Xiangtai, Guo, Weidong, Xu, Yu, Yang, Ming-Hsuan

arXiv.org Artificial IntelligenceFeb-4-2024

In this work, we propose a novel approach to densely ground visual entities from a long caption. We leverage a large multimodal model (LMM) to extract semantic nouns, a class-agnostic segmentation model to generate entity-level segmentation, and the proposed multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask. Additionally, we introduce a strategy of encoding entity segmentation masks into a colormap, enabling the preservation of fine-grained predictions from features of high-resolution masks. This approach allows us to extract visual features from low-resolution images using the CLIP vision encoder in the LMM, which is more computationally efficient than existing approaches that use an additional encoder for high-resolution images. Our comprehensive experiments demonstrate the superiority of our method, outperforming state-of-the-art techniques on three tasks, including panoptic narrative grounding, referring expression segmentation, and panoptic segmentation.

large language model, natural language, segmentation, (15 more...)

arXiv.org Artificial Intelligence

2402.02555

Country: North America > United States > California (0.14)

Genre: Research Report > Promising Solution (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Panoptic Video Scene Graph Generation

Yang, Jingkang, Peng, Wenxuan, Li, Xiangtai, Guo, Zujin, Chen, Liangyu, Li, Bo, Ma, Zheng, Zhou, Kaiyang, Zhang, Wayne, Loy, Chen Change, Liu, Ziwei

arXiv.org Artificial IntelligenceNov-28-2023

Towards building comprehensive real-world visual perception systems, we propose and study a new problem called panoptic scene graph generation (PVSG). PVSG relates to the existing video scene graph generation (VidSGG) problem, which focuses on temporal interactions between humans and objects grounded with bounding boxes in videos. However, the limitation of bounding boxes in detecting non-rigid objects and backgrounds often causes VidSGG to miss key details crucial for comprehensive video understanding. In contrast, PVSG requires nodes in scene graphs to be grounded by more precise, pixel-level segmentation masks, which facilitate holistic scene understanding. To advance research in this new area, we contribute the PVSG dataset, which consists of 400 videos (289 third-person + 111 egocentric videos) with a total of 150K frames labeled with panoptic segmentation masks as well as fine, temporal scene graphs. We also provide a variety of baseline methods and share useful design practices for future work.

artificial intelligence, dataset, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2311.17058

Country: Asia > China (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation

Xie, Jiahao, Li, Wei, Li, Xiangtai, Liu, Ziwei, Ong, Yew Soon, Loy, Chen Change

arXiv.org Artificial IntelligenceSep-22-2023

We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories. Code will be released at https://github.com/Jiahao000/MosaicFusion.

artificial intelligence, machine learning, mosaicfusion, (16 more...)

arXiv.org Artificial Intelligence

2309.13042

Country: Asia (0.14)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Neural Collapse Terminus: A Unified Solution for Class Incremental Learning and Its Variants

Yang, Yibo, Yuan, Haobo, Li, Xiangtai, Wu, Jianlong, Zhang, Lefei, Lin, Zhouchen, Torr, Philip, Tao, Dacheng, Ghanem, Bernard

arXiv.org Artificial IntelligenceAug-3-2023

How to enable learnability for new classes while keeping the capability well on old classes has been a crucial challenge for class incremental learning. Beyond the normal case, long-tail class incremental learning and few-shot class incremental learning are also proposed to consider the data imbalance and data scarcity, respectively, which are common in real-world implementations and further exacerbate the well-known problem of catastrophic forgetting. Existing methods are specifically proposed for one of the three tasks. In this paper, we offer a unified solution to the misalignment dilemma in the three tasks. Concretely, we propose neural collapse terminus that is a fixed structure with the maximal equiangular inter-class separation for the whole label space. It serves as a consistent target throughout the incremental training to avoid dividing the feature space incrementally. For CIL and LTCIL, we further propose a prototype evolving scheme to drive the backbone features into our neural collapse terminus smoothly. Our method also works for FSCIL with only minor adaptations. Theoretical analysis indicates that our method holds the neural collapse optimality in an incremental fashion regardless of data imbalance or data scarcity. We also design a generalized case where we do not know the total number of classes and whether the data distribution is normal, long-tail, or few-shot for each coming session, to test the generalizability of our method. Extensive experiments with multiple datasets are conducted to demonstrate the effectiveness of our unified solution to all the three tasks and the generalized case.

artificial intelligence, learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2308.01746

Country:

Asia > China (0.67)
Asia > Middle East > Saudi Arabia (0.14)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

Pair then Relation: Pair-Net for Panoptic Scene Graph Generation

Wang, Jinghao, Wen, Zhengyu, Li, Xiangtai, Guo, Zujin, Yang, Jingkang, Liu, Ziwei

arXiv.org Artificial IntelligenceAug-1-2023

Panoptic Scene Graph (PSG) is a challenging task in Scene Graph Generation (SGG) that aims to create a more comprehensive scene graph representation using panoptic segmentation instead of boxes. Compared to SGG, PSG has several challenging problems: pixel-level segment outputs and full relationship exploration (It also considers thing and stuff relation). Thus, current PSG methods have limited performance, which hinders downstream tasks or applications. The goal of this work aims to design a novel and strong baseline for PSG. To achieve that, we first conduct an in-depth analysis to identify the bottleneck of the current PSG models, finding that inter-object pair-wise recall is a crucial factor that was ignored by previous PSG methods. Based on this and the recent query-based frameworks, we present a novel framework: Pair then Relation (Pair-Net), which uses a Pair Proposal Network (PPN) to learn and filter sparse pair-wise relationships between subjects and objects. Moreover, we also observed the sparse nature of object pairs for both Motivated by this, we design a lightweight Matrix Learner within the PPN, which directly learn pair-wised relationships for pair proposal generation. Through extensive ablation and analysis, our approach significantly improves upon leveraging the segmenter solid baseline. Notably, our method achieves new state-of-the-art results on the PSG benchmark, with over 10\% absolute gains compared to PSGFormer. The code of this paper is publicly available at https://github.com/king159/Pair-Net.

machine learning, natural language, relation, (15 more...)

arXiv.org Artificial Intelligence

2307.08699

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback