VeXKD: The Versatile Integration of Cross-Modal Fusion and Knowledge Distillation for 3D Perception
–Neural Information Processing Systems
Recent advancements in 3D perception have led to a proliferation of network architectures, particularly those involving multi-modal fusion algorithms. While these fusion algorithms improve accuracy, their complexity often impedes real-time performance. This paper introduces VeXKD, an effective and Versatile framework that integrates Cross-Modal Fusion with Knowledge Distillation. VeXKD applies knowledge distillation exclusively to the Bird's Eye View (BEV) feature maps, enabling the transfer of cross-modal insights to single-modal students without additional inference time overhead. It avoids volatile components that can vary across various 3D perception tasks and student modalities, thus improving versatility. The framework adopts a modality-general cross-modal fusion module to bridge the modality gap between the multi-modal teachers and single-modal students. Furthermore, leveraging byproducts generated during fusion, our BEV query guided mask generation network identifies crucial spatial locations across different BEV feature maps from different tasks and semantic levels in a datadriven manner, significantly enhancing the effectiveness of knowledge distillation. Extensive experiments on the nuScenes dataset demonstrate notable improvements, with up to 6.9%/4.2%
Neural Information Processing Systems
Mar-27-2025, 12:26:44 GMT
- Genre:
- Research Report
- Experimental Study (0.93)
- New Finding (0.67)
- Research Report
- Industry:
- Information Technology (0.93)
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning > Neural Networks (0.68)
- Natural Language (0.93)
- Representation & Reasoning > Information Fusion (1.00)
- Robots (0.93)
- Vision (1.00)
- Data Science > Data Integration (0.86)
- Sensing and Signal Processing > Image Processing (1.00)
- Artificial Intelligence
- Information Technology