AITopics | relevant image

Collaborating Authors

relevant image

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Appendix

Neural Information Processing SystemsFeb-18-2026, 11:47:42 GMT

Your goal is to label if an image matches a search query Images matching a query are called "relevant" You should make sure to label all the relevant images

artificial intelligence, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.04)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
Europe (0.04)
North America > Canada (0.04)

Genre: Research Report (0.46)

Industry:

Law (0.67)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Appendix

Neural Information Processing SystemsOct-10-2025, 19:42:02 GMT

Your goal is to label if an image matches a search query Images matching a query are called "relevant" You should make sure to label all the relevant images

behavior cooperative, dataset, query, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.04)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
Europe (0.04)
North America > Canada (0.04)

Genre: Research Report (0.46)

Industry:

Law (0.67)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

QuRe: Query-Relevant Retrieval through Hard Negative Sampling in Composed Image Retrieval

Kwak, Jaehyun, Inhar, Ramahdani Muhammad Izaaz, Yun, Se-Young, Lee, Sung-Ju

arXiv.org Artificial IntelligenceJul-17-2025

Composed Image Retrieval (CIR) retrieves relevant images based on a reference image and accompanying text describing desired modifications. However, existing CIR methods only focus on retrieving the target image and disregard the relevance of other images. This limitation arises because most methods employing contrastive learning-which treats the target image as positive and all other images in the batch as negatives-can inadvertently include false negatives. This may result in retrieving irrelevant images, reducing user satisfaction even when the target image is retrieved. To address this issue, we propose Query-Relevant Retrieval through Hard Negative Sampling (QuRe), which optimizes a reward model objective to reduce false negatives. Additionally, we introduce a hard negative sampling strategy that selects images positioned between two steep drops in relevance scores following the target image, to effectively filter false negatives. In order to evaluate CIR models on their alignment with human satisfaction, we create Human-Preference FashionIQ (HP-FashionIQ), a new dataset that explicitly captures user preferences beyond target retrieval. Extensive experiments demonstrate that QuRe achieves state-of-the-art performance on FashionIQ and CIRR datasets while exhibiting the strongest alignment with human preferences on the HP-FashionIQ dataset. The source code is available at https://github.com/jackwaky/QuRe.

artificial intelligence, machine learning, target image, (11 more...)

arXiv.org Artificial Intelligence

2507.12416

Genre: Research Report (0.82)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.77)

Add feedback

ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models

Yi, Jingwei, Yin, Junhao, Xu, Ju, Bao, Peng, Wang, Yongliang, Fan, Wei, Wang, Hao

arXiv.org Artificial IntelligenceJan-20-2025

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in understanding multimodal inputs and have been widely integrated into Retrieval-Augmented Generation (RAG) based conversational systems. While current VLM-powered chatbots can provide textual source references in their responses, they exhibit significant limitations in referencing contextually relevant images during conversations. In this paper, we introduce Contextual Image Reference -- the ability to appropriately reference relevant images from retrieval documents based on conversation context -- and systematically investigate VLMs' capability in this aspect. We conduct the first evaluation for contextual image referencing, comprising a dedicated testing dataset and evaluation metrics. Furthermore, we propose ImageRef-VL, a method that significantly enhances open-source VLMs' image referencing capabilities through instruction fine-tuning on a large-scale, manually curated multimodal conversation dataset. Experimental results demonstrate that ImageRef-VL not only outperforms proprietary models but also achieves an 88% performance improvement over state-of-the-art open-source VLMs in contextual image referencing tasks. Our code is available at https://github.com/bytedance/ImageRef-VL.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2501.12418

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Asia > China (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

INQUIRE: A Natural World Text-to-Image Retrieval Benchmark

Vendrow, Edward, Pantazis, Omiros, Shepard, Alexander, Brostow, Gabriel, Jones, Kate E., Mac Aodha, Oisin, Beery, Sara, Van Horn, Grant

arXiv.org Artificial IntelligenceNov-11-2024

We introduce INQUIRE, a text-to-image retrieval benchmark designed to challenge multimodal vision-language models on expert-level queries. INQUIRE includes iNaturalist 2024 (iNat24), a new dataset of five million natural world images, along with 250 expert-level retrieval queries. These queries are paired with all relevant images comprehensively labeled within iNat24, comprising 33,000 total matches. Queries span categories such as species identification, context, behavior, and appearance, emphasizing tasks that require nuanced image understanding and domain expertise. Our benchmark evaluates two core retrieval tasks: (1) INQUIRE-Fullrank, a full dataset ranking task, and (2) INQUIRE-Rerank, a reranking task for refining top-100 retrievals. Detailed evaluation of a range of recent multimodal models demonstrates that INQUIRE poses a significant challenge, with the best models failing to achieve an mAP@50 above 50%. In addition, we show that reranking with more powerful multimodal models can enhance retrieval performance, yet there remains a significant margin for improvement. By focusing on scientifically-motivated ecological challenges, INQUIRE aims to bridge the gap between AI capabilities and the needs of real-world scientific inquiry, encouraging the development of retrieval systems that can assist with accelerating ecological and biodiversity research. Our dataset and code are available at https://inquire-benchmark.github.io

dataset, query, retrieval, (15 more...)

arXiv.org Artificial Intelligence

2411.02537

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
North America > United States > California (0.04)
Europe (0.04)
North America > Canada (0.04)

Genre: Research Report > New Finding (0.45)

Industry:

Law (0.92)
Information Technology (0.67)
Government (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(2 more...)

Add feedback

LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages in Multimodal Image Retrieval Task

Asgarov, Ali, Rustamov, Samir

arXiv.org Artificial IntelligenceAug-25-2024

This research explores the development of multimodal vision-language models for image retrieval in low-resource languages, specifically Azerbaijani. Existing vision-language models primarily support high-resource languages, and fine-tuning them remains computationally demanding. To address challenges in vision-language retrieval for low-resource languages, we integrated the CLIP model architecture and employed several techniques to balance computational efficiency with performance. These techniques include synthetic data generation through machine translation, image augmentation, and further training the attention mechanisms of transformer-based models with domain-specific data. We integrated Multilingual BERT as a text encoder with image encoders like ResNet50, EfficientNet0, Vision Transformer (ViT), and Tiny Swin Transformer. Our study found that models like EfficientNet0 and Tiny Swin Transformer perform best on the datasets they were trained on, such as COCO, Flickr30k, and Flickr8k. Augmentation techniques boosted EfficientNet0 MAP on Flickr30k from 0.84 to 0.87 and ResNet50 MAP on MSCOCO from 0.70 to 0.80, contributing to a new state of the art in vision-language retrieval. We share our configurations and results to support further research. Code and pre-trained models are available at https://github.com/aliasgerovs/azclip.

dataset, query, relevant image, (15 more...)

arXiv.org Artificial Intelligence

2408.13909

Country: North America > United States (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine (0.47)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(3 more...)

Add feedback

CLIPping the Limits: Finding the Sweet Spot for Relevant Images in Automated Driving Systems Perception Testing

Rigoll, Philipp, Adolph, Laurenz, Ries, Lennart, Sax, Eric

arXiv.org Artificial IntelligenceJul-12-2024

Perception systems, especially cameras, are the eyes of automated driving systems. Ensuring that they function reliably and robustly is therefore an important building block in the automation of vehicles. There are various approaches to test the perception of automated driving systems. Ultimately, however, it always comes down to the investigation of the behavior of perception systems under specific input data. Camera images are a crucial part of the input data. Image data sets are therefore collected for the testing of automated driving systems, but it is non-trivial to find specific images in these data sets. Thanks to recent developments in neural networks, there are now methods for sorting the images in a data set according to their similarity to a prompt in natural language. In order to further automate the provision of search results, we make a contribution by automating the threshold definition in these sorted results and returning only the images relevant to the prompt as a result. Our focus is on preventing false positives and false negatives equally. It is also important that our method is robust and in the case that our assumptions are not fulfilled, we provide a fallback solution.

cosine distance, gaussian distribution, threshold value, (15 more...)

arXiv.org Artificial Intelligence

2404.05309

Country:

Europe > Austria > Vienna (0.14)
Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.05)
North America > United States > California > San Diego County > La Jolla (0.04)
(5 more...)

Genre: Research Report (0.51)

Industry:

Transportation > Ground > Road (1.00)
Information Technology > Robotics & Automation (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (1.00)
(2 more...)

Add feedback

Detecting Frames in News Headlines and Lead Images in U.S. Gun Violence Coverage

Tourni, Isidora Chara, Guo, Lei, Hu, Hengchang, Halim, Edward, Ishwar, Prakash, Daryanto, Taufiq, Jalal, Mona, Chen, Boqi, Betke, Margrit, Zhafransyah, Fabian, Lai, Sha, Wijaya, Derry Tanti

arXiv.org Artificial IntelligenceJun-24-2024

News media structure their reporting of events or issues using certain perspectives. When describing an incident involving gun violence, for example, some journalists may focus on mental health or gun regulation, while others may emphasize the discussion of gun rights. Such perspectives are called \say{frames} in communication research. We study, for the first time, the value of combining lead images and their contextual information with text to identify the frame of a given news article. We observe that using multiple modes of information(article- and image-derived features) improves prediction of news frames over any single mode of information when the images are relevant to the frames of the headlines. We also observe that frame image relevance is related to the ease of conveying frames via images, which we call frame concreteness. Additionally, we release the first multimodal news framing dataset related to gun violence in the U.S., curated and annotated by communication researchers. The dataset will allow researchers to further examine the use of multiple information modalities for studying media framing.

concreteness, gun violence, information, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2021.findings-emnlp.339

2406.17213

Country:

North America > United States > Massachusetts > Plymouth County > Brockton (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > New Brunswick > York County > Fredericton (0.04)
(15 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Media > News (1.00)
Law > Criminal Law (1.00)
Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.68)
Information Technology > Sensing and Signal Processing > Image Processing (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Advancements in Content-Based Image Retrieval: A Comprehensive Survey of Relevance Feedback Techniques

Qazanfari, Hamed, AlyanNezhadi, Mohammad M., Khoshdaregi, Zohreh Nozari

arXiv.org Artificial IntelligenceDec-13-2023

Content-based image retrieval (CBIR) systems have emerged as crucial tools in the field of computer vision, allowing for image search based on visual content rather than relying solely on metadata. This survey paper presents a comprehensive overview of CBIR, emphasizing its role in object detection and its potential to identify and retrieve visually similar images based on content features. Challenges faced by CBIR systems, including the semantic gap and scalability, are discussed, along with potential solutions. It elaborates on the semantic gap, which arises from the disparity between low-level features and high-level semantic concepts, and explores approaches to bridge this gap. One notable solution is the integration of relevance feedback (RF), empowering users to provide feedback on retrieved images and refine search results iteratively. The survey encompasses long-term and short-term learning approaches that leverage RF for enhanced CBIR accuracy and relevance. These methods focus on weight optimization and the utilization of active learning algorithms to select samples for training classifiers. Furthermore, the paper investigates machine learning techniques and the utilization of deep learning and convolutional neural networks to enhance CBIR performance. This survey paper plays a significant role in advancing the understanding of CBIR and RF techniques. It guides researchers and practitioners in comprehending existing methodologies, challenges, and potential solutions while fostering knowledge dissemination and identifying research gaps. By addressing future research directions, it sets the stage for advancements in CBIR that will enhance retrieval accuracy, usability, and effectiveness in various application domains.

image retrieval, relevance feedback, retrieval, (11 more...)

arXiv.org Artificial Intelligence

2312.10089

Country:

Asia > Middle East > Iran (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)

Genre:

Research Report (1.00)
Overview (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

ZRIGF: An Innovative Multimodal Framework for Zero-Resource Image-Grounded Dialogue Generation

Zhang, Bo, Wang, Jian, Ma, Hui, Xu, Bo, Lin, Hongfei

arXiv.org Artificial IntelligenceAug-2-2023

Image-grounded dialogue systems benefit greatly from integrating visual information, resulting in high-quality response generation. However, current models struggle to effectively utilize such information in zero-resource scenarios, mainly due to the disparity between image and text modalities. To overcome this challenge, we propose an innovative multimodal framework, called ZRIGF, which assimilates image-grounded information for dialogue generation in zero-resource situations. ZRIGF implements a two-stage learning strategy, comprising contrastive pre-training and generative pre-training. Contrastive pre-training includes a text-image matching module that maps images and texts into a unified encoded vector space, along with a text-assisted masked image modeling module that preserves pre-training visual features and fosters further multimodal feature alignment. Generative pre-training employs a multimodal fusion module and an information transfer module to produce insightful responses based on harmonized multimodal representations. Comprehensive experiments conducted on both text-based and image-grounded dialogue datasets demonstrate ZRIGF's efficacy in generating contextually pertinent and informative responses. Furthermore, we adopt a fully zero-resource scenario in the image-grounded dialogue dataset to demonstrate our framework's robust generalization capabilities in novel domains. The code is available at https://github.com/zhangbo-nlp/ZRIGF.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3581783.3611810

2308.004

Country:

North America > Canada > Ontario > National Capital Region > Ottawa (0.05)
Asia > China > Liaoning Province > Dalian (0.05)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.70)

Add feedback