Cross-Modal Distillation For Widely Differing Modalities

Zhao, Cairong, Jin, Yufeng, Song, Zifan, Chen, Haonan, Miao, Duoqian, Hu, Guosheng

Oct-7-2025–arXiv.org Artificial Intelligence

Abstract--Deep learning achieved great progress recently, however, it is not easy or efficient to further improve its performance by increasing the size of the model. Multi-modal learning can mitigate this challenge by introducing richer and more discriminative information as input. T o solve the problem of limited access to multi-modal data at the time of use, we conduct multi-modal learning by introducing a teacher model to transfer discriminative knowledge to a student model during training. However, this knowledge transfer via distillation is not trivial because the big domain gap between the widely differing modalities can easily lead to overfitting. In this work, we introduce a cross-modal distillation framework. Specifically, we find hard constrained loss, e.g. T o address this, we propose two soft constrained knowledge distillation strategies at the feature level and classifier level respectively . In addition, we propose a quality-based adaptive weights module to weigh input samples via quantified data quality, leading to robust model training. We conducted experiments on speaker recognition and image classification tasks, and the results show that our approach is able to effectively achieve knowledge transfer between the commonly used and widely differing modalities of image, text, and speech. The rapid advancement of deep learning has revolutionized numerous fields by enabling the development of increasingly complex and powerful models. However, as model sizes continue to grow, the marginal benefits of scaling up models diminish, prompting researchers to explore alternative strategies for improving performance. One such strategy is multi-modal learning, which leverages the complementary strengths of multiple data modalities--such as images, speech, and text--to enhance task performance. While multi-modal learning has shown promise in various applications, its practical adoption is often hindered by the high cost and complexity of acquiring and processing multi-modal data. This limitation raises a critical question: how can we effectively utilize multi-modal data during training when only uni-modal data is available during deployment? T o address this challenge, we propose a novel framework for cross-modal knowledge distillation, which enables the transfer of knowledge from a strong modality (e.g., images) to a weak modality (e.g., speech) during training, even when only the weak modality is available during inference.

artificial intelligence, machine learning, modality, (17 more...)

arXiv.org Artificial Intelligence

Oct-7-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.28)

Genre:
- Research Report > New Finding (0.66)

Industry:
- Education (0.71)
- Energy (0.48)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found