Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval

Caffagni, Davide, Sarto, Sara, Cornia, Marcella, Baraldi, Lorenzo, Cucchiara, Rita

Mar-3-2025–arXiv.org Artificial Intelligence

Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning designs, and its application in LLMs and multimodal LLMs. In this paper, we move a step forward and design an approach that allows for multimodal queries, composed of both an image and a text, and can search within collections of multimodal documents, where images and text are interleaved. Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones, both at the query and document side. To allow for multi-level and cross-modal understanding and feature extraction, ReT employs a novel Transformer-based recurrent cell that integrates both textual and visual features at different layers, and leverages sigmoidal gates inspired by the classical design of LSTMs. Extensive experiments on M2KR and M-BEIR benchmarks show that ReT achieves state-of-the-art performance across diverse settings. Our source code and trained models are publicly available at https://github.com/aimagelab/ReT.

dataset, representation, ret, (16 more...)

arXiv.org Artificial Intelligence

Mar-3-2025

arXiv.org PDF

Add feedback

Country:
- Pacific Ocean > North Pacific Ocean
  - San Francisco Bay > Golden Gate (0.04)
- North America > United States
  - Nebraska (0.04)
  - Illinois (0.04)
  - California
    - San Francisco County > San Francisco (0.04)
    - Los Angeles County > Los Angeles
      - Hollywood (0.04)
- Europe
  - Switzerland > Bern
    - Bern (0.04)
  - Russia > Central Federal District
    - Moscow Oblast > Moscow (0.04)
  - Italy
    - Tuscany > Pisa Province
      - Pisa (0.04)
    - Emilia-Romagna > Modeno Province
      - Modena (0.04)
- Asia
  - China (0.04)
  - Middle East > Syria (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found