Goto

Collaborating Authors

 Liu, Honghai


Text-Derived Relational Graph-Enhanced Network for Skeleton-Based Action Segmentation

arXiv.org Artificial Intelligence

--Skeleton-based T emporal Action Segmentation (ST AS) aims to segment and recognize various actions from long, untrimmed sequences of human skeletal movements. Current ST AS methods typically employ spatio-temporal modeling to establish dependencies among joints as well as frames, and utilize one-hot encoding with cross-entropy loss for frame-wise classification supervision. However, these methods overlook the intrinsic correlations among joints and actions within skeletal features, leading to a limited understanding of human movements. T o address this, we propose a T ext-Derived Relational Graph-Enhanced Network (TRG-Net) that leverages prior graphs generated by Large Language Models (LLM) to enhance both modeling and supervision. For modeling, the Dynamic Spatio-T emporal Fusion Modeling (DSFM) method incorporates T ext-Derived Joint Graphs (TJG) with channel-and frame-level dynamic adaptation to effectively model spatial relations, while integrating spatio-temporal core features during temporal modeling. For supervision, the Absolute-Relative Inter-Class Supervision (ARIS) method employs contrastive learning between action features and text embeddings to regularize the absolute class distributions, and utilizes T ext-Derived Action Graphs (T AG) to capture the relative inter-class relationships among action features. Additionally, we propose a Spatial-A ware Enhancement Processing (SAEP) method, which incorporates random joint occlusion and axial rotation to enhance spatial generalization. Performance evaluations on four public datasets demonstrate that TRG-Net achieves state-of-the-art results. EMPORAL Action Segmentation (T AS), an advanced task in video understanding, aims to segment and recognize each action within long, untrimmed video sequences of human activities [1]. Similar to how semantic segmentation predicts labels for each pixel in an image, T AS predicts action labels for each frame in a video. As a significant task in computer vision, T AS finds applications in various domains such as medical rehabilitation, [2], industrial monitoring [3], and activity analysis [4]. Haoyu Ji, Bowen Chen, Weihong Ren, Wenze Huang, Zhihao Y ang, Zhiyong Wang, and Honghai Liu are with the State Key Laboratory of Robotics and Systems, Harbin Institute of Technology Shenzhen, Shenzhen 518055, China (e-mail: jihaoyu1224@gmail.com, The code is available at https://github.com/HaoyuJi/TRG-Net. The text embeddings and relational graphs generated by large language models can serve as priors for enhancing modeling and supervision of action segmentation. Specifically, the text-derived joint graph effectively captures spatial correlations, while the text-derived action graph and action embeddings supervise the relationships and distributions of action classes. Existing T AS methods can be broadly categorized into two types based on input modality: Video-based T AS (VT AS) and Skeleton-based T AS (ST AS) [5]-[7].


The Unified Balance Theory of Second-Moment Exponential Scaling Optimizers in Visual Tasks

arXiv.org Artificial Intelligence

Existing first-order optimizers mainly include two branches: classical optimizers represented by Stochastic Gradient Descent (SGD) and adaptive optimizers represented by Adam, along with their many derivatives. The debate over the merits and demerits of these two types of optimizers has persisted for a decade. In practical experience, it is generally considered that SGD is more suitable for tasks like Computer Vision(CV), while adaptive optimizers are widely used in tasks with sparse gradients, such as Large Language Models(LLM). Although adaptive optimizers always offer better convergence speeds in almost all tasks, they can lead to over-fitting in some cases, resulting in poorer generalization performance compared to SGD in certain tasks. Even in Large Language Models, Adam continues to face challenges, and its original strategy may not always have an advantage due to the introduction of improvements such as gradient clipping. With a wide variety of optimization methods available, it is essential to introduce a unified, interpretable theory. This paper will discuss under the framework of first-order optimizers and, through the intervention of the balance theory, will for the first time propose a unified strategy to integrate all first-order optimization methods.


Asymmetric Momentum: A Rethinking of Gradient Descent

arXiv.org Artificial Intelligence

Through theoretical and experimental validation, unlike all existing adaptive methods like Adam which penalize frequently-changing parameters and are only applicable to sparse gradients, we propose the simplest SGD enhanced method, Loss-Controlled Asymmetric Momentum(LCAM). By averaging the loss, we divide training process into different loss phases and using different momentum. It not only can accelerates slow-changing parameters for sparse gradients, similar to adaptive optimizers, but also can choose to accelerates frequently-changing parameters for non-sparse gradients, thus being adaptable to all types of datasets. We reinterpret the machine learning training process through the concepts of weight coupling and weight traction, and experimentally validate that weights have directional specificity, which are correlated with the specificity of the dataset. Thus interestingly, we observe that in non-sparse gradients, frequently-changing parameters should actually be accelerated, which is completely opposite to traditional adaptive perspectives. Compared to traditional SGD with momentum, this algorithm separates the weights without additional computational costs. It is noteworthy that this method relies on the network's ability to extract complex features. We primarily use Wide Residual Networks for our research, employing the classic datasets Cifar10 and Cifar100 to test the ability for feature separation and conclude phenomena that are much more important than just accuracy rates. Finally, compared to classic SGD tuning methods, while using WRN on these two datasets and with nearly half the training epochs, we achieve equal or better test accuracy.