CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora

Long, Zijun, Ge, Xuri, Mccreadie, Richard, Jose, Joemon

Apr-2-2024–arXiv.org Artificial Intelligence

Text-to-image retrieval aims to find the relevant images based on a text query, which is important in various use-cases, such as digital libraries, e-commerce, and multimedia databases. Although Multimodal Large Language Models (MLLMs) demonstrate state-of-the-art performance, they exhibit limitations in handling large-scale, diverse, and ambiguous real-world needs of retrieval, due to the computation cost and the injective embeddings they produce. This paper presents a two-stage Coarse-to-Fine Index-shared Retrieval (CFIR) framework, designed for fast and effective large-scale long-text to image retrieval. The first stage, Entity-based Ranking (ER), adapts to long-text query ambiguity by employing a multiple-queries-to-multiple-targets paradigm, facilitating candidate filtering for the next stage. The second stage, Summary-based Re-ranking (SR), refines these rankings using summarized queries. We also propose a specialized Decoupling-BEiT-3 encoder, optimized for handling ambiguous user needs and both stages, which also enhances computational efficiency through vector-based similarity inference. Evaluation on the AToMiC dataset reveals that CFIR surpasses existing MLLMs by up to 11.06% in Recall@1000, while reducing training and retrieval times by 68.75% and 99.79%, respectively. We will release our code to facilitate future research at https://github.com/longkukuhi/CFIR.

cfir, query, retrieval, (12 more...)

arXiv.org Artificial Intelligence

Apr-2-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - District of Columbia > Washington (0.05)
  - New York > New York County
    - New York City (0.04)
- Europe > Spain
  - Aragón (0.05)
  - Galicia > Madrid (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - Taiwan > Taiwan Province
    - Taipei (0.04)

Genre:
- Research Report > New Finding (0.68)

Industry:
- Leisure & Entertainment (1.00)
- Media > Film (0.68)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Natural Language > Large Language Model (0.48)
    - Machine Learning > Neural Networks (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found