Goto

Collaborating Authors

 Xia, Gui-Song


LiDAR-enhanced 3D Gaussian Splatting Mapping

arXiv.org Artificial Intelligence

This paper introduces LiGSM, a novel LiDAR-enhanced 3D Gaussian Splatting (3DGS) mapping framework that improves the accuracy and robustness of 3D scene mapping by integrating LiDAR data. LiGSM constructs joint loss from images and LiDAR point clouds to estimate the poses and optimize their extrinsic parameters, enabling dynamic adaptation to variations in sensor alignment. Furthermore, it leverages LiDAR point clouds to initialize 3DGS, providing a denser and more reliable starting points compared to sparse SfM points. In scene rendering, the framework augments standard image-based supervision with depth maps generated from LiDAR projections, ensuring an accurate scene representation in both geometry and photometry. Experiments on public and self-collected datasets demonstrate that LiGSM outperforms comparative methods in pose tracking and scene rendering.


UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation

arXiv.org Artificial Intelligence

Recently, text-to-image generation models have achieved remarkable advancements, particularly with diffusion models facilitating high-quality image synthesis from textual descriptions. However, these models often struggle with achieving precise control over pixel-level layouts, object appearances, and global styles when using text prompts alone. To mitigate this issue, previous works introduce conditional images as auxiliary inputs for image generation, enhancing control but typically necessitating specialized models tailored to different types of reference inputs. In this paper, we explore a new approach to unify controllable generation within a single framework. Specifically, we propose the unified image-instruction adapter (UNIC-Adapter) built on the Multi-Modal-Diffusion Transformer architecture, to enable flexible and controllable generation across diverse conditions without the need for multiple specialized models. Our UNIC-Adapter effectively extracts multi-modal instruction information by incorporating both conditional images and task instructions, injecting this information into the image generation process through a cross-attention mechanism enhanced by Rotary Position Embedding. Experimental results across a variety of tasks, including pixel-level spatial control, subject-driven image generation, and style-image-based image synthesis, demonstrate the effectiveness of our UNIC-Adapter in unified controllable image generation.


MCVO: A Generic Visual Odometry for Arbitrarily Arranged Multi-Cameras

arXiv.org Artificial Intelligence

Making multi-camera visual SLAM systems easier to set up and more robust to the environment is always one of the focuses of vision robots. Existing monocular and binocular vision SLAM systems have narrow FoV and are fragile in textureless environments with degenerated accuracy and limited robustness. Thus multi-camera SLAM systems are gaining attention because they can provide redundancy for texture degeneration with wide FoV. However, current multi-camera SLAM systems face massive data processing pressure and elaborately designed camera configurations, leading to estimation failures for arbitrarily arranged multi-camera systems. To address these problems, we propose a generic visual odometry for arbitrarily arranged multi-cameras, which can achieve metric-scale state estimation with high flexibility in the cameras' arrangement. Specifically, we first design a learning-based feature extraction and tracking framework to shift the pressure of CPU processing of multiple video streams. Then we use the rigid constraints between cameras to estimate the metric scale poses for robust SLAM system initialization. Finally, we fuse the features of the multi-cameras in the SLAM back-end to achieve robust pose estimation and online scale optimization. Additionally, multi-camera features help improve the loop detection for pose graph optimization. Experiments on KITTI-360 and MultiCamData datasets validate the robustness of our method over arbitrarily placed cameras. Compared with other stereo and multi-camera visual SLAM systems, our method obtains higher pose estimation accuracy with better generalization ability. Our codes and online demos are available at \url{https://github.com/JunhaoWang615/MCVO}


QuadricsReg: Large-Scale Point Cloud Registration using Quadric Primitives

arXiv.org Artificial Intelligence

In the realm of large-scale point cloud registration, designing a compact symbolic representation is crucial for efficiently processing vast amounts of data, ensuring registration robustness against significant viewpoint variations and occlusions. This paper introduces a novel point cloud registration method, i.e., QuadricsReg, which leverages concise quadrics primitives to represent scenes and utilizes their geometric characteristics to establish correspondences for 6-DoF transformation estimation. As a symbolic feature, the quadric representation fully captures the primary geometric characteristics of scenes, which can efficiently handle the complexity of large-scale point clouds. The intrinsic characteristics of quadrics, such as types and scales, are employed to initialize correspondences. Then we build a multi-level compatibility graph set to find the correspondences using the maximum clique on the geometric consistency between quadrics. Finally, we estimate the 6-DoF transformation using the quadric correspondences, which is further optimized based on the quadric degeneracy-aware distance in a factor graph, ensuring high registration accuracy and robustness against degenerate structures. We test on 5 public datasets and the self-collected heterogeneous dataset across different LiDAR sensors and robot platforms. The exceptional registration success rates and minimal registration errors demonstrate the effectiveness of QuadricsReg in large-scale point cloud registration scenarios. Furthermore, the real-world registration testing on our self-collected heterogeneous dataset shows the robustness and generalization ability of QuadricsReg on different LiDAR sensors and robot platforms. The codes and demos will be released at \url{https://levenberg.github.io/QuadricsReg}.


LaVIDE: A Language-Vision Discriminator for Detecting Changes in Satellite Image with Map References

arXiv.org Artificial Intelligence

Change detection, which typically relies on the comparison of bi-temporal images, is significantly hindered when only a single image is available. Comparing a single image with an existing map, such as OpenStreetMap, which is continuously updated through crowd-sourcing, offers a viable solution to this challenge. Unlike images that carry low-level visual details of ground objects, maps convey high-level categorical information. This discrepancy in abstraction levels complicates the alignment and comparison of the two data types. In this paper, we propose a \textbf{La}nguage-\textbf{VI}sion \textbf{D}iscriminator for d\textbf{E}tecting changes in satellite image with map references, namely \ours{}, which leverages language to bridge the information gap between maps and images. Specifically, \ours{} formulates change detection as the problem of ``{\textit Does the pixel belong to [class]?}'', aligning maps and images within the feature space of the language-vision model to associate high-level map categories with low-level image details. Moreover, we build a mixture-of-experts discriminative module, which compares linguistic features from maps with visual features from images across various semantic perspectives, achieving comprehensive semantic comparison for change detection. Extensive evaluation on four benchmark datasets demonstrates that \ours{} can effectively detect changes in satellite image with map references, outperforming state-of-the-art change detection algorithms, e.g., with gains of about $13.8$\% on the DynamicEarthNet dataset and $4.3$\% on the SECOND dataset.


TAS: Distilling Arbitrary Teacher and Student via a Hybrid Assistant

arXiv.org Artificial Intelligence

Most knowledge distillation (KD) methodologies predominantly focus on teacherstudent pairs with similar architectures, such as both being convolutional neural networks (CNNs). However, the potential and flexibility of KD can be greatly improved by expanding it to novel Cross-Architecture KD (CAKD), where the knowledge of homogeneous and heterogeneous teachers can be transferred flexibly to a given student. The primary challenge in CAKD lies in the substantial feature gaps between heterogeneous models, originating from the distinction of their inherent inductive biases and module functions. To this end, we introduce an assistant model as a bridge to facilitate smooth feature knowledge transfer between heterogeneous teachers and students. More importantly, within our proposed design principle, the assistant model combines the advantages of cross-architecture inductive biases and module functions by merging convolution and attention modules derived from both student and teacher module functions. Furthermore, we observe that heterogeneous features exhibit diverse spatial distributions in CAKD, hindering the effectiveness of conventional pixel-wise mean squared error (MSE) loss. Therefore, we leverage a spatial-agnostic InfoNCE loss to align features after spatial smoothing, thereby improving the feature alignments in CAKD. Our proposed method is evaluated across some homogeneous model pairs and arbitrary heterogeneous combinations of CNNs, ViTs, and MLPs, achieving state-of-the-art performance for distilled models with a maximum gain of 11.47% on CIFAR-100 and 3.67% on ImageNet-1K. Our code and models will be released. Knowledge Distillation (KD) (Hinton et al., 2015; Romero et al., 2015) has been demonstrated as a powerful method to transfer knowledge from a pre-trained and cumbersome teacher model to a compact and efficient student model. Compared to the model trained from scratch, the performance of the student model distilled by appropriate teachers usually improves significantly.


DMTG: One-Shot Differentiable Multi-Task Grouping

arXiv.org Machine Learning

We aim to address Multi-Task Learning (MTL) with a large number of tasks by Multi-Task Grouping (MTG). Given N tasks, we propose to simultaneously identify the best task groups from 2^N candidates and train the model weights simultaneously in one-shot, with the high-order task-affinity fully exploited. This is distinct from the pioneering methods which sequentially identify the groups and train the model weights, where the group identification often relies on heuristics. As a result, our method not only improves the training efficiency, but also mitigates the objective bias introduced by the sequential procedures that potentially lead to a suboptimal solution. Specifically, we formulate MTG as a fully differentiable pruning problem on an adaptive network architecture determined by an underlying Categorical distribution. To categorize N tasks into K groups (represented by K encoder branches), we initially set up KN task heads, where each branch connects to all N task heads to exploit the high-order task-affinity. Then, we gradually prune the KN heads down to N by learning a relaxed differentiable Categorical distribution, ensuring that each task is exclusively and uniquely categorized into only one branch. Extensive experiments on CelebA and Taskonomy datasets with detailed ablations show the promising performance and efficiency of our method. The codes are available at https://github.com/ethanygao/DMTG.


Optimization-based Structural Pruning for Large Language Models without Back-Propagation

arXiv.org Machine Learning

Compared to the moderate size of neural network models, structural weight pruning on the Large-Language Models (LLMs) imposes a novel challenge on the efficiency of the pruning algorithms, due to the heavy computation/memory demands of the LLMs. Recent efficient LLM pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on heuristically designed metrics, potentially leading to suboptimal performance. We instead propose a novel optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve the efficiency, our method 1) works at post-training phase} and 2) eliminates the back-propagation through the LLM per se during the optimization (i.e., only requires the forward pass of the LLM). We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from the LLM loss, thus facilitating an efficient optimization via a policy gradient estimator without back-propagation. As a result, our method is able to 1) operate at structural granularities of channels, heads, and layers, 2) support global and heterogeneous pruning (i.e., our method automatically determines different redundancy for different layers), and 3) optionally use a metric-based method as initialization (of our Bernoulli distributions). Extensive experiments on LLaMA, LLaMA-2, and Vicuna using the C4 and WikiText2 datasets demonstrate that our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU, and our pruned models outperform the state-of-the-arts w.r.t. perplexity. Codes will be released.


Aux-NAS: Exploiting Auxiliary Labels with Negligibly Extra Inference Cost

arXiv.org Machine Learning

We aim at exploiting additional auxiliary labels from an independent (auxiliary) task to boost the primary task performance which we focus on, while preserving a single task inference cost of the primary task. While most existing auxiliary learning methods are optimization-based relying on loss weights/gradients manipulation, our method is architecture-based with a flexible asymmetric structure for the primary and auxiliary tasks, which produces different networks for training and inference. Specifically, starting from two single task networks/branches (each representing a task), we propose a novel method with evolving networks where only primary-to-auxiliary links exist as the cross-task connections after convergence. These connections can be removed during the primary task inference, resulting in a single-task inference cost. We achieve this by formulating a Neural Architecture Search (NAS) problem, where we initialize bi-directional connections in the search space and guide the NAS optimization converging to an architecture with only the single-side primary-to-auxiliary connections. Moreover, our method can be incorporated with optimization-based auxiliary learning approaches. In this paper, we tackle the practical issue of auxiliary learning, which involves improving the performance of a specific task (i.e., the primary task) while incorporating additional auxiliary labels from different tasks (i.e., the auxiliary tasks). We aim to efficiently leverage these auxiliary labels to enhance the primary task's performance while maintaining a comparable computational and parameter cost to a single-task network when evaluating the primary task.


QuadricsNet: Learning Concise Representation for Geometric Primitives in Point Clouds

arXiv.org Artificial Intelligence

This paper presents a novel framework to learn a concise geometric primitive representation for 3D point clouds. Different from representing each type of primitive individually, we focus on the challenging problem of how to achieve a concise and uniform representation robustly. We employ quadrics to represent diverse primitives with only 10 parameters and propose the first end-to-end learning-based framework, namely QuadricsNet, to parse quadrics in point clouds. The relationships between quadrics mathematical formulation and geometric attributes, including the type, scale and pose, are insightfully integrated for effective supervision of QuaidricsNet. Besides, a novel pattern-comprehensive dataset with quadrics segments and objects is collected for training and evaluation. Experiments demonstrate the effectiveness of our concise representation and the robustness of QuadricsNet. Our code is available at \url{https://github.com/MichaelWu99-lab/QuadricsNet}