ColMate: Contrastive Late Interaction and Masked Text for Multimodal Document Retrieval

Masry, Ahmed, Thakkar, Megh, Bechard, Patrice, Madhusudhan, Sathwik Tejaswi, Awal, Rabiul, Mishra, Shambhavi, Suresh, Akshay Kalkunte, Daruru, Srivatsava, Hoque, Enamul, Gella, Spandana, Scholak, Torsten, Rajeswar, Sai

Nov-4-2025–arXiv.org Artificial Intelligence

Retrieval-augmented generation has proven practical when models require specialized knowledge or access to the latest data. However, existing methods for multimodal document retrieval often replicate techniques developed for text-only retrieval, whether in how they encode documents, define training objectives, or compute similarity scores. To address these limitations, we present ColMate, a document retrieval model that bridges the gap between multimodal representation learning and document retrieval. ColMate utilizes a novel OCR-based pretraining objective, a self-supervised masked contrastive learning objective, and a late interaction scoring mechanism more relevant to multimodal document structures and visual characteristics. ColMate obtains 3.61% improvements over existing retrieval models on the ViDoRe V2 benchmark, demonstrating stronger generalization to out-of-domain benchmarks.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Nov-4-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Switzerland (0.28)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks (0.35)
  - Natural Language
    - Large Language Model (0.50)
    - Chatbot (0.36)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found