Cross-Modal Distillation For Widely Differing Modalities
Zhao, Cairong, Jin, Yufeng, Song, Zifan, Chen, Haonan, Miao, Duoqian, Hu, Guosheng
–arXiv.org Artificial Intelligence
Abstract--Deep learning achieved great progress recently, however, it is not easy or efficient to further improve its performance by increasing the size of the model. Multi-modal learning can mitigate this challenge by introducing richer and more discriminative information as input. T o solve the problem of limited access to multi-modal data at the time of use, we conduct multi-modal learning by introducing a teacher model to transfer discriminative knowledge to a student model during training. However, this knowledge transfer via distillation is not trivial because the big domain gap between the widely differing modalities can easily lead to overfitting. In this work, we introduce a cross-modal distillation framework. Specifically, we find hard constrained loss, e.g. T o address this, we propose two soft constrained knowledge distillation strategies at the feature level and classifier level respectively . In addition, we propose a quality-based adaptive weights module to weigh input samples via quantified data quality, leading to robust model training. We conducted experiments on speaker recognition and image classification tasks, and the results show that our approach is able to effectively achieve knowledge transfer between the commonly used and widely differing modalities of image, text, and speech. The rapid advancement of deep learning has revolutionized numerous fields by enabling the development of increasingly complex and powerful models. However, as model sizes continue to grow, the marginal benefits of scaling up models diminish, prompting researchers to explore alternative strategies for improving performance. One such strategy is multi-modal learning, which leverages the complementary strengths of multiple data modalities--such as images, speech, and text--to enhance task performance. While multi-modal learning has shown promise in various applications, its practical adoption is often hindered by the high cost and complexity of acquiring and processing multi-modal data. This limitation raises a critical question: how can we effectively utilize multi-modal data during training when only uni-modal data is available during deployment? T o address this challenge, we propose a novel framework for cross-modal knowledge distillation, which enables the transfer of knowledge from a strong modality (e.g., images) to a weak modality (e.g., speech) during training, even when only the weak modality is available during inference.
arXiv.org Artificial Intelligence
Oct-7-2025
- Country:
- Asia > China
- Shanghai > Shanghai (0.04)
- Zhejiang Province > Hangzhou (0.04)
- Europe > United Kingdom (0.04)
- Asia > China
- Genre:
- Research Report > New Finding (0.66)
- Technology: