AITopics | flickr

Collaborating Authors

flickr

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models

Neural Information Processing SystemsDec-27-2025, 14:38:00 GMT

Cross-modal (image-to-text and text-to-image) retrieval is an established task used in evaluation benchmarks to test the performance of vision-language models (VLMs).

artificial intelligence, natural language, proceedings, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language (0.42)

Add feedback

WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models

Neural Information Processing SystemsOct-10-2025, 22:37:27 GMT

Cross-modal (image-to-text and text-to-image) retrieval is an established task used in evaluation benchmarks to test the performance of vision-language models (VLMs).

blip-2, caption, dataset, (15 more...)

Neural Information Processing Systems

Country:

North America > United States (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
Europe > Poland (0.04)

Industry: Information Technology (0.70)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.46)

Add feedback

Probing the Limits of Stylistic Alignment in Vision-Language Models

Farajidizaji, Asma, Gupta, Akash, Raina, Vatsal

arXiv.org Artificial IntelligenceOct-1-2025

Vision-language models are increasingly used to generate image captions in specific styles, such as humor or romantic. However, these transformer-based models often struggle with this subjective task in a zero-shot setting. While preference data can be used to align them toward a desired style, such data is expensive to acquire, limiting the ability to explore the models' full capabilities. This work addresses this by studying the data efficiency of aligning small vision-language models to humor and romantic styles. This approach helps to define the performance limits of these models and determine how little preference data is needed to achieve stylistic saturation, benchmarking their capabilities and limitations.

caption, large language model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2509.25568

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models

Neural Information Processing SystemsMay-27-2025, 22:05:46 GMT

Cross-modal (image-to-text and text-to-image) retrieval is an established task used in evaluation benchmarks to test the performance of vision-language models (VLMs). CLIP, BLIP-2) have achieved near-perfect performance on widely-used image-text retrieval benchmarks such as MSCOCO-Test-5K and Flickr30K-Test-1K. As a measure of out-of-distribution (OOD) generalization, prior works rely on zero-shot performance evaluated on one dataset (Flickr) using a VLM finetuned on another one (MSCOCO). We argue that such comparisons are insufficient to assess the OOD generalization capability of models due to high visual and linguistic similarity between the evaluation and finetuning datasets. To address this gap, we introduce WikiDO (drawn from Wikipedia Diversity Observatory), a novel cross-modal retrieval benchmark to assess the OOD generalization capabilities of pretrained VLMs.

benchmark, cross-modal retrieval, vision-language model, (11 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.98)
Information Technology > Communications > Social Media (0.65)
Information Technology > Artificial Intelligence > Vision (0.63)

Add feedback

Nearest Neighbor Normalization Improves Multimodal Retrieval

Chowdhury, Neil, Wang, Franklin, Shenoy, Sumedh, Kiela, Douwe, Schwettmann, Sarah, Thrush, Tristan

arXiv.org Artificial IntelligenceOct-31-2024

Multimodal models leverage large-scale pre-training to achieve strong but still imperfect performance on tasks such as image captioning, visual question answering, and cross-modal retrieval. In this paper, we present a simple and efficient method for correcting errors in trained contrastive image-text retrieval models with no additional training, called Nearest Neighbor Normalization (NNN). We show an improvement on retrieval metrics in both text retrieval and image retrieval for all of the contrastive models that we tested (CLIP, BLIP, ALBEF, SigLIP, BEiT) and for both of the datasets that we used (MS-COCO and Flickr30k). NNN requires a reference database, but does not require any training on this database, and can even increase the retrieval accuracy of a model after finetuning.

albef ft, beit-3 ft, retrieval, (17 more...)

arXiv.org Artificial Intelligence

2410.24114

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > Singapore (0.04)

Genre: Research Report (1.00)

Industry: Information Technology (0.55)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Case-Based Reasoning (0.61)

Add feedback

TinyGraph: Joint Feature and Node Condensation for Graph Neural Networks

Liu, Yezi, Shen, Yanning

arXiv.org Artificial IntelligenceJul-10-2024

Training graph neural networks (GNNs) on large-scale graphs can be challenging due to the high computational expense caused by the massive number of nodes and high-dimensional nodal features. Existing graph condensation studies tackle this problem only by reducing the number of nodes in the graph. However, the resulting condensed graph data can still be cumbersome. Specifically, although the nodes of the Citeseer dataset are reduced to 0.9% (30 nodes) in training, the number of features is 3,703, severely exceeding the training sample magnitude. Faced with this challenge, we study the problem of joint condensation for both features and nodes in large-scale graphs. This task is challenging mainly due to 1) the intertwined nature of the node features and the graph structure calls for the feature condensation solver to be structure-aware; and 2) the difficulty of keeping useful information in the condensed graph. To address these challenges, we propose a novel framework TinyGraph, to condense features and nodes simultaneously in graphs. Specifically, we cast the problem as matching the gradients of GNN weights trained on the condensed graph and the gradients obtained from training over the original graph, where the feature condensation is achieved by a trainable function. The condensed graph obtained by minimizing the matching loss along the training trajectory can henceforth retain critical information in the original graph. Extensive experiments were carried out to demonstrate the effectiveness of the proposed TinyGraph. For example, a GNN trained with TinyGraph retains 98.5% and 97.5% of the original test accuracy on the Cora and Citeseer datasets, respectively, while significantly reducing the number of nodes by 97.4% and 98.2%, and the number of features by 90.0% on both datasets.

condensation, graph, tinygraph, (16 more...)

arXiv.org Artificial Intelligence

2407.08064

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > California > Orange County > Irvine (0.04)

Genre: Research Report (0.64)

Industry: Information Technology (0.70)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

How many views does your deep neural network use for prediction?

Kawano, Keisuke, Kutsuna, Takuro, Sano, Keisuke

arXiv.org Artificial IntelligenceFeb-1-2024

The generalization ability of Deep Neural Networks (DNNs) is still not fully understood, despite numerous theoretical and empirical analyses. Recently, Allen-Zhu & Li (2023) introduced the concept of multi-views to explain the generalization ability of DNNs, but their main target is ensemble or distilled models, and no method for estimating multi-views used in a prediction of a specific input is discussed. In this paper, we propose Minimal Sufficient Views (MSVs), which is similar to multi-views but can be efficiently computed for real images. MSVs is a set of minimal and distinct features in an input, each of which preserves a model's prediction for the input. We empirically show that there is a clear relationship between the number of MSVs and prediction accuracy across models, including convolutional and transformer models, suggesting that a multi-view like perspective is also important for understanding the generalization ability of (non-ensemble or non-distilled) DNNs.

msv, prediction, validation, (16 more...)

arXiv.org Artificial Intelligence

2402.01095

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > Wisconsin > Dane County > Madison (0.04)

Genre: Research Report (1.00)

Industry: Information Technology (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

On the Transferability of Visually Grounded PCFGs

Zhao, Yanpeng, Titov, Ivan

arXiv.org Artificial IntelligenceOct-21-2023

There has been a significant surge of interest in visually grounded grammar induction in recent times. While a variety of models have been developed for the task and have demonstrated impressive performance, they have not been evaluated on text domains that are different from the training domain, so it is unclear if the improvements brought by visual groundings are transferable. Our study aims to fill this gap and assess the degree of transferability. We start by extending VC-PCFG (short for Visually-grounded Compound PCFG~\citep{zhao-titov-2020-visually}) in such a way that it can transfer across text domains. We consider a zero-shot transfer learning setting where a model is trained on the source domain and is directly applied to target domains, without any further training. Our experimental results suggest that: the benefits from using visual groundings transfer to text in a domain similar to the training domain but fail to transfer to remote domains. Further, we conduct data and result analysis; we find that the lexicon overlap between the source domain and the target domain is the most important factor in the transferability of VC-PCFG.

computational linguistic, enweb, target domain, (14 more...)

arXiv.org Artificial Intelligence

2310.14107

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.14)
North America > United States > California > Los Angeles County > Los Angeles (0.14)
Europe > Italy > Tuscany > Florence (0.04)
(5 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Which country is this picture from? New data and methods for DNN-based country recognition

Alamayreh, Omran, Dimitri, Giovanna Maria, Wang, Jun, Tondi, Benedetta, Barni, Mauro

arXiv.org Artificial IntelligenceFeb-17-2023

Recognizing the country where a picture has been taken has many potential applications, such as identification of fake news and prevention of disinformation campaigns. Previous works focused on the estimation of the geo-coordinates where a picture has been taken. Yet, recognizing in which country an image was taken could be more critical, from a semantic and forensic point of view, than estimating its spatial coordinates. In the above framework, this paper provides two contributions. First, we introduce the VIPPGeo dataset, containing 3.8 million geo-tagged images. Secondly, we used the dataset to train a model casting the country recognition problem as a classification problem. The experiments show that our model provides better results than the current state of the art. Notably, we found that asking the network to identify the country provides better results than estimating the geo-coordinates and then tracing them back to the country where the picture was taken.

artificial intelligence, dataset, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2209.02429

Country:

North America > United States (0.14)
Asia > Japan (0.05)
Europe > Italy (0.05)
(16 more...)

Genre: Research Report (0.82)

Industry:

Government (1.00)
Media > News (0.86)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Active Learning from the Web

Sato, Ryoma

arXiv.org Artificial IntelligenceFeb-10-2023

Labeling data is one of the most costly processes in machine learning pipelines. Active learning is a standard approach to alleviating this problem. Pool-based active learning first builds a pool of unlabelled data and iteratively selects data to be labeled so that the total number of required labels is minimized, keeping the model performance high. Many effective criteria for choosing data from the pool have been proposed in the literature. However, how to build the pool is less explored. Specifically, most of the methods assume that a task-specific pool is given for free. In this paper, we advocate that such a task-specific pool is not always available and propose the use of a myriad of unlabelled data on the Web for the pool for which active learning is applied. As the pool is extremely large, it is likely that relevant data exist in the pool for many tasks, and we do not need to explicitly design and build the pool for each task. The challenge is that we cannot compute the acquisition scores of all data exhaustively due to the size of the pool. We propose an efficient method, Seafaring, to retrieve informative data in terms of active learning from the Web using a user-side information retrieval algorithm. In the experiments, we use the online Flickr environment as the pool for active learning. This pool contains more than ten billion images and is several orders of magnitude larger than the existing pools in the literature for active learning. We confirm that our method performs better than existing approaches of using a small unlabelled pool.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2210.08205

Country:

North America > United States > Texas > Travis County > Austin (0.05)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
North America > United States > New York > New York County > New York City (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry: Information Technology > Services (0.52)

Technology:

Information Technology > Data Science > Data Mining > Web Mining (0.85)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback