Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts

Oct-10-2024, 18:17:48 GMT–Neural Information Processing Systems

Large sparsely-activated models have obtained excellent performance in multiple domains.However, such models are typically trained on a single modality at a time.We present the Language-Image MoE, LIMoE, a sparse mixture of experts model capable of multimodal learning.LIMoE accepts both images and text simultaneously, while being trained using a contrastive loss.MoEs are a natural fit for a multimodal backbone, since expert layers can learn an appropriate partitioning of modalities.However, new challenges arise; in particular, training stability and balanced expert utilization, for which we propose an entropy-based regularization scheme.Across multiple scales, we demonstrate performance improvement over dense models of equivalent computational cost.LIMoE-L/16 trained comparably to CLIP-L/14 achieves 77.9% zero-shot ImageNet accuracy (vs.

language-image mixture, limoe, multimodal contrastive learning, (1 more...)

Neural Information Processing Systems

Oct-10-2024, 18:17:48 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.73)