Ji, Haoyu
Text-Derived Relational Graph-Enhanced Network for Skeleton-Based Action Segmentation
Ji, Haoyu, Chen, Bowen, Ren, Weihong, Huang, Wenze, Yang, Zhihao, Wang, Zhiyong, Liu, Honghai
--Skeleton-based T emporal Action Segmentation (ST AS) aims to segment and recognize various actions from long, untrimmed sequences of human skeletal movements. Current ST AS methods typically employ spatio-temporal modeling to establish dependencies among joints as well as frames, and utilize one-hot encoding with cross-entropy loss for frame-wise classification supervision. However, these methods overlook the intrinsic correlations among joints and actions within skeletal features, leading to a limited understanding of human movements. T o address this, we propose a T ext-Derived Relational Graph-Enhanced Network (TRG-Net) that leverages prior graphs generated by Large Language Models (LLM) to enhance both modeling and supervision. For modeling, the Dynamic Spatio-T emporal Fusion Modeling (DSFM) method incorporates T ext-Derived Joint Graphs (TJG) with channel-and frame-level dynamic adaptation to effectively model spatial relations, while integrating spatio-temporal core features during temporal modeling. For supervision, the Absolute-Relative Inter-Class Supervision (ARIS) method employs contrastive learning between action features and text embeddings to regularize the absolute class distributions, and utilizes T ext-Derived Action Graphs (T AG) to capture the relative inter-class relationships among action features. Additionally, we propose a Spatial-A ware Enhancement Processing (SAEP) method, which incorporates random joint occlusion and axial rotation to enhance spatial generalization. Performance evaluations on four public datasets demonstrate that TRG-Net achieves state-of-the-art results. EMPORAL Action Segmentation (T AS), an advanced task in video understanding, aims to segment and recognize each action within long, untrimmed video sequences of human activities [1]. Similar to how semantic segmentation predicts labels for each pixel in an image, T AS predicts action labels for each frame in a video. As a significant task in computer vision, T AS finds applications in various domains such as medical rehabilitation, [2], industrial monitoring [3], and activity analysis [4]. Haoyu Ji, Bowen Chen, Weihong Ren, Wenze Huang, Zhihao Y ang, Zhiyong Wang, and Honghai Liu are with the State Key Laboratory of Robotics and Systems, Harbin Institute of Technology Shenzhen, Shenzhen 518055, China (e-mail: jihaoyu1224@gmail.com, The code is available at https://github.com/HaoyuJi/TRG-Net. The text embeddings and relational graphs generated by large language models can serve as priors for enhancing modeling and supervision of action segmentation. Specifically, the text-derived joint graph effectively captures spatial correlations, while the text-derived action graph and action embeddings supervise the relationships and distributions of action classes. Existing T AS methods can be broadly categorized into two types based on input modality: Video-based T AS (VT AS) and Skeleton-based T AS (ST AS) [5]-[7].