Goto

Collaborating Authors

 relevant image


Appendix

Neural Information Processing Systems

Your goal is to label if an image matches a search query Images matching a query are called "relevant" You should make sure to label all the relevant images


Appendix

Neural Information Processing Systems

Your goal is to label if an image matches a search query Images matching a query are called "relevant" You should make sure to label all the relevant images


QuRe: Query-Relevant Retrieval through Hard Negative Sampling in Composed Image Retrieval

Kwak, Jaehyun, Inhar, Ramahdani Muhammad Izaaz, Yun, Se-Young, Lee, Sung-Ju

arXiv.org Artificial Intelligence

Composed Image Retrieval (CIR) retrieves relevant images based on a reference image and accompanying text describing desired modifications. However, existing CIR methods only focus on retrieving the target image and disregard the relevance of other images. This limitation arises because most methods employing contrastive learning-which treats the target image as positive and all other images in the batch as negatives-can inadvertently include false negatives. This may result in retrieving irrelevant images, reducing user satisfaction even when the target image is retrieved. To address this issue, we propose Query-Relevant Retrieval through Hard Negative Sampling (QuRe), which optimizes a reward model objective to reduce false negatives. Additionally, we introduce a hard negative sampling strategy that selects images positioned between two steep drops in relevance scores following the target image, to effectively filter false negatives. In order to evaluate CIR models on their alignment with human satisfaction, we create Human-Preference FashionIQ (HP-FashionIQ), a new dataset that explicitly captures user preferences beyond target retrieval. Extensive experiments demonstrate that QuRe achieves state-of-the-art performance on FashionIQ and CIRR datasets while exhibiting the strongest alignment with human preferences on the HP-FashionIQ dataset. The source code is available at https://github.com/jackwaky/QuRe.


ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models

Yi, Jingwei, Yin, Junhao, Xu, Ju, Bao, Peng, Wang, Yongliang, Fan, Wei, Wang, Hao

arXiv.org Artificial Intelligence

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in understanding multimodal inputs and have been widely integrated into Retrieval-Augmented Generation (RAG) based conversational systems. While current VLM-powered chatbots can provide textual source references in their responses, they exhibit significant limitations in referencing contextually relevant images during conversations. In this paper, we introduce Contextual Image Reference -- the ability to appropriately reference relevant images from retrieval documents based on conversation context -- and systematically investigate VLMs' capability in this aspect. We conduct the first evaluation for contextual image referencing, comprising a dedicated testing dataset and evaluation metrics. Furthermore, we propose ImageRef-VL, a method that significantly enhances open-source VLMs' image referencing capabilities through instruction fine-tuning on a large-scale, manually curated multimodal conversation dataset. Experimental results demonstrate that ImageRef-VL not only outperforms proprietary models but also achieves an 88% performance improvement over state-of-the-art open-source VLMs in contextual image referencing tasks. Our code is available at https://github.com/bytedance/ImageRef-VL.


INQUIRE: A Natural World Text-to-Image Retrieval Benchmark

Vendrow, Edward, Pantazis, Omiros, Shepard, Alexander, Brostow, Gabriel, Jones, Kate E., Mac Aodha, Oisin, Beery, Sara, Van Horn, Grant

arXiv.org Artificial Intelligence

We introduce INQUIRE, a text-to-image retrieval benchmark designed to challenge multimodal vision-language models on expert-level queries. INQUIRE includes iNaturalist 2024 (iNat24), a new dataset of five million natural world images, along with 250 expert-level retrieval queries. These queries are paired with all relevant images comprehensively labeled within iNat24, comprising 33,000 total matches. Queries span categories such as species identification, context, behavior, and appearance, emphasizing tasks that require nuanced image understanding and domain expertise. Our benchmark evaluates two core retrieval tasks: (1) INQUIRE-Fullrank, a full dataset ranking task, and (2) INQUIRE-Rerank, a reranking task for refining top-100 retrievals. Detailed evaluation of a range of recent multimodal models demonstrates that INQUIRE poses a significant challenge, with the best models failing to achieve an mAP@50 above 50%. In addition, we show that reranking with more powerful multimodal models can enhance retrieval performance, yet there remains a significant margin for improvement. By focusing on scientifically-motivated ecological challenges, INQUIRE aims to bridge the gap between AI capabilities and the needs of real-world scientific inquiry, encouraging the development of retrieval systems that can assist with accelerating ecological and biodiversity research. Our dataset and code are available at https://inquire-benchmark.github.io


LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages in Multimodal Image Retrieval Task

Asgarov, Ali, Rustamov, Samir

arXiv.org Artificial Intelligence

This research explores the development of multimodal vision-language models for image retrieval in low-resource languages, specifically Azerbaijani. Existing vision-language models primarily support high-resource languages, and fine-tuning them remains computationally demanding. To address challenges in vision-language retrieval for low-resource languages, we integrated the CLIP model architecture and employed several techniques to balance computational efficiency with performance. These techniques include synthetic data generation through machine translation, image augmentation, and further training the attention mechanisms of transformer-based models with domain-specific data. We integrated Multilingual BERT as a text encoder with image encoders like ResNet50, EfficientNet0, Vision Transformer (ViT), and Tiny Swin Transformer. Our study found that models like EfficientNet0 and Tiny Swin Transformer perform best on the datasets they were trained on, such as COCO, Flickr30k, and Flickr8k. Augmentation techniques boosted EfficientNet0 MAP on Flickr30k from 0.84 to 0.87 and ResNet50 MAP on MSCOCO from 0.70 to 0.80, contributing to a new state of the art in vision-language retrieval. We share our configurations and results to support further research. Code and pre-trained models are available at https://github.com/aliasgerovs/azclip.


CLIPping the Limits: Finding the Sweet Spot for Relevant Images in Automated Driving Systems Perception Testing

Rigoll, Philipp, Adolph, Laurenz, Ries, Lennart, Sax, Eric

arXiv.org Artificial Intelligence

Perception systems, especially cameras, are the eyes of automated driving systems. Ensuring that they function reliably and robustly is therefore an important building block in the automation of vehicles. There are various approaches to test the perception of automated driving systems. Ultimately, however, it always comes down to the investigation of the behavior of perception systems under specific input data. Camera images are a crucial part of the input data. Image data sets are therefore collected for the testing of automated driving systems, but it is non-trivial to find specific images in these data sets. Thanks to recent developments in neural networks, there are now methods for sorting the images in a data set according to their similarity to a prompt in natural language. In order to further automate the provision of search results, we make a contribution by automating the threshold definition in these sorted results and returning only the images relevant to the prompt as a result. Our focus is on preventing false positives and false negatives equally. It is also important that our method is robust and in the case that our assumptions are not fulfilled, we provide a fallback solution.


Detecting Frames in News Headlines and Lead Images in U.S. Gun Violence Coverage

Tourni, Isidora Chara, Guo, Lei, Hu, Hengchang, Halim, Edward, Ishwar, Prakash, Daryanto, Taufiq, Jalal, Mona, Chen, Boqi, Betke, Margrit, Zhafransyah, Fabian, Lai, Sha, Wijaya, Derry Tanti

arXiv.org Artificial Intelligence

News media structure their reporting of events or issues using certain perspectives. When describing an incident involving gun violence, for example, some journalists may focus on mental health or gun regulation, while others may emphasize the discussion of gun rights. Such perspectives are called \say{frames} in communication research. We study, for the first time, the value of combining lead images and their contextual information with text to identify the frame of a given news article. We observe that using multiple modes of information(article- and image-derived features) improves prediction of news frames over any single mode of information when the images are relevant to the frames of the headlines. We also observe that frame image relevance is related to the ease of conveying frames via images, which we call frame concreteness. Additionally, we release the first multimodal news framing dataset related to gun violence in the U.S., curated and annotated by communication researchers. The dataset will allow researchers to further examine the use of multiple information modalities for studying media framing.


Advancements in Content-Based Image Retrieval: A Comprehensive Survey of Relevance Feedback Techniques

Qazanfari, Hamed, AlyanNezhadi, Mohammad M., Khoshdaregi, Zohreh Nozari

arXiv.org Artificial Intelligence

Content-based image retrieval (CBIR) systems have emerged as crucial tools in the field of computer vision, allowing for image search based on visual content rather than relying solely on metadata. This survey paper presents a comprehensive overview of CBIR, emphasizing its role in object detection and its potential to identify and retrieve visually similar images based on content features. Challenges faced by CBIR systems, including the semantic gap and scalability, are discussed, along with potential solutions. It elaborates on the semantic gap, which arises from the disparity between low-level features and high-level semantic concepts, and explores approaches to bridge this gap. One notable solution is the integration of relevance feedback (RF), empowering users to provide feedback on retrieved images and refine search results iteratively. The survey encompasses long-term and short-term learning approaches that leverage RF for enhanced CBIR accuracy and relevance. These methods focus on weight optimization and the utilization of active learning algorithms to select samples for training classifiers. Furthermore, the paper investigates machine learning techniques and the utilization of deep learning and convolutional neural networks to enhance CBIR performance. This survey paper plays a significant role in advancing the understanding of CBIR and RF techniques. It guides researchers and practitioners in comprehending existing methodologies, challenges, and potential solutions while fostering knowledge dissemination and identifying research gaps. By addressing future research directions, it sets the stage for advancements in CBIR that will enhance retrieval accuracy, usability, and effectiveness in various application domains.


ZRIGF: An Innovative Multimodal Framework for Zero-Resource Image-Grounded Dialogue Generation

Zhang, Bo, Wang, Jian, Ma, Hui, Xu, Bo, Lin, Hongfei

arXiv.org Artificial Intelligence

Image-grounded dialogue systems benefit greatly from integrating visual information, resulting in high-quality response generation. However, current models struggle to effectively utilize such information in zero-resource scenarios, mainly due to the disparity between image and text modalities. To overcome this challenge, we propose an innovative multimodal framework, called ZRIGF, which assimilates image-grounded information for dialogue generation in zero-resource situations. ZRIGF implements a two-stage learning strategy, comprising contrastive pre-training and generative pre-training. Contrastive pre-training includes a text-image matching module that maps images and texts into a unified encoded vector space, along with a text-assisted masked image modeling module that preserves pre-training visual features and fosters further multimodal feature alignment. Generative pre-training employs a multimodal fusion module and an information transfer module to produce insightful responses based on harmonized multimodal representations. Comprehensive experiments conducted on both text-based and image-grounded dialogue datasets demonstrate ZRIGF's efficacy in generating contextually pertinent and informative responses. Furthermore, we adopt a fully zero-resource scenario in the image-grounded dialogue dataset to demonstrate our framework's robust generalization capabilities in novel domains. The code is available at https://github.com/zhangbo-nlp/ZRIGF.