Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Koukounas, Andreas, Mastrapas, Georgios, Günther, Michael, Wang, Bo, Martens, Scott, Mohr, Isabelle, Sturua, Saba, Akram, Mohammad Kalim, Martínez, Joan Fontanals, Ognawala, Saahil, Guzman, Susana, Werk, Maximilian, Wang, Nan, Xiao, Han

Jun-26-2024–arXiv.org Artificial Intelligence

Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.

clip model, openai clip eva-clip longclip dataset, preprint arxiv, (7 more...)

arXiv.org Artificial Intelligence

Jun-26-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Singapore (0.04)
- Europe
  - Poland (0.04)
  - Germany > Berlin (0.04)

Genre:
- Research Report (0.84)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language
    - Information Retrieval (0.74)
    - Text Processing (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found