Song, Kun
Internal Activation Revision: Safeguarding Vision Language Models Without Parameter Update
Li, Qing, Geng, Jiahui, Chen, Zongxiong, Song, Kun, Ma, Lei, Karray, Fakhri
Vision-language models (VLMs) demonstrate strong multimodal capabilities but have been found to be more susceptible to generating harmful content compared to their backbone large language models (LLMs). Our investigation reveals that the integration of images significantly shifts the model's internal activations during the forward pass, diverging from those triggered by textual input. Moreover, the safety alignments of LLMs embedded within VLMs are not sufficiently robust to handle the activations discrepancies, making the models vulnerable to even the simplest jailbreaking attacks. To address this issue, we propose an \textbf{internal activation revision} approach that efficiently revises activations during generation, steering the model toward safer outputs. Our framework incorporates revisions at both the layer and head levels, offering control over the model's generation at varying levels of granularity. In addition, we explore three strategies for constructing positive and negative samples and two approaches for extracting revision vectors, resulting in different variants of our method. Comprehensive experiments demonstrate that the internal activation revision method significantly improves the safety of widely used VLMs, reducing attack success rates by an average of 48.94\%, 34.34\%, 43.92\%, and 52.98\% on SafeBench, Safe-Unsafe, Unsafe, and MM-SafetyBench, respectively, while minimally impacting model helpfulness.
Enhance Hyperbolic Representation Learning via Second-order Pooling
Song, Kun, Solozabal, Ruben, hao, Li, Ren, Lu, Abdar, Moloud, Li, Qing, Karray, Fakhri, Takac, Martin
Hyperbolic representation learning is well known for its ability to capture hierarchical information. However, the distance between samples from different levels of hierarchical classes can be required large. We reveal that the hyperbolic discriminant objective forces the backbone to capture this hierarchical information, which may inevitably increase the Lipschitz constant of the backbone. This can hinder the full utilization of the backbone's generalization ability. To address this issue, we introduce second-order pooling into hyperbolic representation learning, as it naturally increases the distance between samples without compromising the generalization ability of the input features. In this way, the Lipschitz constant of the backbone does not necessarily need to be large. However, current off-the-shelf low-dimensional bilinear pooling methods cannot be directly employed in hyperbolic representation learning because they inevitably reduce the distance expansion capability. To solve this problem, we propose a kernel approximation regularization, which enables the low-dimensional bilinear features to approximate the kernel function well in low-dimensional space. Finally, we conduct extensive experiments on graph-structured datasets to demonstrate the effectiveness of the proposed method.
Exploring Information-Theoretic Metrics Associated with Neural Collapse in Supervised Training
Song, Kun, Tan, Zhiquan, Zou, Bochao, Chen, Jiansheng, Ma, Huimin, Huang, Weiran
In this paper, we utilize information-theoretic metrics like matrix entropy and mutual information to analyze supervised learning. We explore the information content of data representations and classification head weights and their information interplay during supervised training. Experiments show that matrix entropy cannot solely describe the interaction of the information content of data representation and classification head weights but it can effectively reflect the similarity and clustering behavior of the data. Inspired by this, we propose a cross-modal alignment loss to improve the alignment between the representations of the same class from different modalities. Moreover, in order to assess the interaction of the information content of data representation and classification head weights more accurately, we utilize new metrics like matrix mutual information ratio (MIR) and matrix information entropy difference ratio (HDR). Through theory and experiment, we show that HDR and MIR can not only effectively describe the information interplay of supervised training but also improve the performance of supervised and semi-supervised learning.
Distributed Motion Control of Multiple Mobile Manipulator System with Disturbance and Communication Delay
Liu, Wenhang, Ren, Meng, Song, Kun, Wang, Michael Yu, Xiong, Zhenhua
In real-world object manipulation scenarios, multiple mobile manipulator systems may suffer from disturbances and asynchrony, leading to excessive interaction forces and causing object damage or emergency stops. This paper presents a novel distributed motion control approach aimed at reducing these unnecessary interaction forces. The control strategy only utilizes force information without the need for global position and velocity information. Disturbances are corrected through compensatory movements of the manipulators. Besides, the asymmetric, non-uniform, and time-varying communication delays between robots are also considered. The stability of the control law is rigorously proven by the Lyapunov theorem. Subsequently, the efficacy of the proposed control law is validated through simulations and experiments of collaborative object transportation by two robots. Experimental results demonstrate the effectiveness of the proposed control law in reducing interaction forces during object manipulation.
Unveiling the Dynamics of Information Interplay in Supervised Learning
Song, Kun, Tan, Zhiquan, Zou, Bochao, Ma, Huimin, Huang, Weiran
In this paper, we use matrix information theory as an analytical tool to analyze the dynamics of the information interplay between data representations and classification head vectors in the supervised learning process. Specifically, inspired by the theory of Neural Collapse, we introduce matrix mutual information ratio (MIR) and matrix entropy difference ratio (HDR) to assess the interactions of data representation and class classification heads in supervised learning, and we determine the theoretical optimal values for MIR and HDR when Neural Collapse happens. Our experiments show that MIR and HDR can effectively explain many phenomena occurring in neural networks, for example, the standard supervised training dynamics, linear mode connectivity, and the performance of label smoothing and pruning. Additionally, we use MIR and HDR to gain insights into the dynamics of grokking, which is an intriguing phenomenon observed in supervised training, where the model demonstrates generalization capabilities long after it has learned to fit the training data. Furthermore, we introduce MIR and HDR as loss terms in supervised and semi-supervised learning to optimize the information interactions among samples and classification heads. The empirical results provide evidence of the method's effectiveness, demonstrating that the utilization of MIR and HDR not only aids in comprehending the dynamics throughout the training process but can also enhances the training procedure itself.
RHAML: Rendezvous-based Hierarchical Architecture for Mutual Localization
Chen, Gaoming, Song, Kun, Xu, Xiang, Liu, Wenhang, Xiong, Zhenhua
Mutual localization serves as the foundation for collaborative perception and task assignment in multi-robot systems. Effectively utilizing limited onboard sensors for mutual localization between marker-less robots is a worthwhile goal. However, due to inadequate consideration of large scale variations of the observed robot and localization refinement, previous work has shown limited accuracy when robots are equipped only with RGB cameras. To enhance the precision of localization, this paper proposes a novel rendezvous-based hierarchical architecture for mutual localization (RHAML). Firstly, to learn multi-scale robot features, anisotropic convolutions are introduced into the network, yielding initial localization results. Then, the iterative refinement module with rendering is employed to adjust the observed robot poses. Finally, the pose graph is conducted to globally optimize all localization results, which takes into account multi-frame observations. Therefore, a flexible architecture is provided that allows for the selection of appropriate modules based on requirements. Simulations demonstrate that RHAML effectively addresses the problem of multi-robot mutual localization, achieving translation errors below 2 cm and rotation errors below 0.5 degrees when robots exhibit 5 m of depth variation. Moreover, its practical utility is validated by applying it to map fusion when multi-robots explore unknown environments.
Multi-Robot Rendezvous in Unknown Environment with Limited Communication
Song, Kun, Chen, Gaoming, Liu, Wenhang, Xiong, Zhenhua
Rendezvous aims at gathering all robots at a specific location, which is an important collaborative behavior for multirobot systems. However, in an unknown environment, it is challenging to achieve rendezvous. Previous researches mainly focus on special scenarios where communication is not allowed and each robot executes a random searching strategy, which is highly time-consuming, especially in large-scale environments. In this work, we focus on rendezvous in unknown environments where communication is available. We divide this task into two steps: rendezvous based environment exploration with relative pose (RP) estimation and rendezvous point election. A new strategy called partitioned and incomplete exploration for rendezvous (PIER) is proposed to efficiently explore the unknown environment, where lightweight topological maps are constructed and shared among robots for RP estimation with very few communications. Then, a rendezvous point selection algorithm based on the merged topological map is proposed for efficient rendezvous for multi-robot systems. The effectiveness of the proposed methods is validated in both simulations and real-world experiments.
Motion Planning for Multiple Mobile Manipulator System in Complex Flipping Manipulation
Liu, Wenhang, Song, Kun, Ren, Meng, Hu, Jiawei, Wang, Michael Yu, Xiong, Zhenhua
Multiple robot systems are favored for object manipulation and transportation, especially for large objects. However, in more complex manipulation such as flipping, these systems encounter a new challenge, configuration disconnectivity of manipulators. Grasping objects by manipulators will impose closed-chain constraints on the system, which in turn limits the feasible motions of manipulators and further compromises the configuration connectivity. Multiple mobile manipulator systems show much more flexibility in object manipulation with the mobility of the mobile platform and have the potential to address the above problem. In this paper, a novel planning framework is proposed for complex flipping manipulation by incorporating platform motions and regrasping. Firstly, two types of trajectories, mobile manipulator planning and regrasping planning, are classified and can be assigned different priorities for different tasks. Secondly, corresponding planning methods are designed for each type of trajectory. Specifically, in mobile manipulator planning, the configuration of the platform is determined through optimization to ensure connectivity when the manipulator approaches configuration boundaries. In regrasping planning, closed-chain constraints are temporarily disregarded and the manipulation capabilities are prioritized to facilitate subsequent planning. Finally, the structure of the overall planning framework is provided. Experimental results demonstrate that the proposed planner efficiently plans the motions of the system to accomplish flipping manipulation. Additionally, a comprehensive experiment emphasizes the significance of our planner in extending the capabilities of multiple mobile manipulator systems in complex tasks.
FHT-Map: Feature-based Hierarchical Topological Map for Relocalization and Path Planning
Song, Kun, Liu, Wenhang, Chen, Gaoming, Xu, Xiang, Xiong, Zhenhua
Topological maps are favorable for their small storage compared to geometric map. However, they are limited in relocalization and path planning capabilities. To solve this problem, a feature-based hierarchical topological map (FHT-Map) is proposed along with a real-time map construction algorithm for robot exploration. Specifically, the FHT-Map utilizes both RGB cameras and LiDAR information and consists of two types of nodes: main node and support node. Main nodes will store visual information compressed by convolutional neural network and local laser scan data to enhance subsequent relocalization capability. Support nodes retain a minimal amount of data to ensure storage efficiency while facilitating path planning. After map construction with robot exploration, the FHT-Map can be used by other robots for relocalization and path planning. Experiments are conducted in Gazebo simulator, and the results demonstrate that the proposed FHT-Map can effectively improve relocalization and path planning capability compared with other topological maps. Moreover, experiments on hierarchical architecture are implemented to show the necessity of two types of nodes.
Parameter Free Large Margin Nearest Neighbor for Distance Metric Learning
Song, Kun (Northwestern Polytechnical University) | Nie, Feiping (Northwestern Polytechnical University) | Han, Junwei (Northwestern Polytechnical University) | Li, Xuelong (Xi'an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences)
We introduce a novel supervised metric learning algorithm named parameter free large margin nearest neighbor (PFLMNN) which can be seen as an improvement of the classical large margin nearest neighbor (LMNN) algorithm. The contributions of our work consist of two aspects. First, our method discards the costterm which shrinks the distances between inquiry input and its k target neighbors (the k nearest neighbors with same labels as inquiry input) in LMNN, and only focuses on improving the action to push the imposters (the samples with different labels form the inquiry input) apart out of the neighborhood of inquiry. As a result, our method does not have the parameter needed to tune on the validating set, which makes it more convenient to use. Second, by leveraging the geometry information of the imposters, we construct a novel cost function to penalize the smalldistances between each inquiry and its imposters. Different from LMNN considering every imposter located in the neighborhood of each inquiry, our method only takes care of the nearest imposters. Because when the nearest imposter is pushed out of the neighborhood of its inquiry, other imposters would be all out. In this way, the constraints in our model are much less than that of LMNN, which makes our method much easier to find the optimal distance metric. Consequently, our method not only learns a better distance metric than LMNN, but also runs faster than LMNN. Extensive experiments on different data sets with various sizes and difficulties are conducted, and the results have shown that, compared with LMNN, PFLMNN achieves better classification results.