Geodesic Multi-Modal Mixup for Robust Fine-Tuning

Jan-19-2025, 17:44:37 GMT–Neural Information Processing Systems

Pre-trained multi-modal models, such as CLIP, provide transferable embeddings and show promising results in diverse applications. However, the analysis of learned multi-modal embeddings is relatively unexplored, and the embedding transferability can be improved. In this work, we observe that CLIP holds separated embedding subspaces for two different modalities, and then we investigate it through the lens of \textit{uniformity-alignment} to measure the quality of learned representation. Both theoretically and empirically, we show that CLIP retains poor uniformity and alignment even after fine-tuning. Such a lack of alignment and uniformity might restrict the transferability and robustness of embeddings.

geodesic multi-modal mixup, representation, robust fine-tuning, (2 more...)

Neural Information Processing Systems

Jan-19-2025, 17:44:37 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence (0.40)