Goto

Collaborating Authors

 flickr





Probing the Limits of Stylistic Alignment in Vision-Language Models

Farajidizaji, Asma, Gupta, Akash, Raina, Vatsal

arXiv.org Artificial Intelligence

Vision-language models are increasingly used to generate image captions in specific styles, such as humor or romantic. However, these transformer-based models often struggle with this subjective task in a zero-shot setting. While preference data can be used to align them toward a desired style, such data is expensive to acquire, limiting the ability to explore the models' full capabilities. This work addresses this by studying the data efficiency of aligning small vision-language models to humor and romantic styles. This approach helps to define the performance limits of these models and determine how little preference data is needed to achieve stylistic saturation, benchmarking their capabilities and limitations.


WikiDO: A New Benchmark Evaluating Cross-Modal Retrieval for Vision-Language Models

Neural Information Processing Systems

Cross-modal (image-to-text and text-to-image) retrieval is an established task used in evaluation benchmarks to test the performance of vision-language models (VLMs). CLIP, BLIP-2) have achieved near-perfect performance on widely-used image-text retrieval benchmarks such as MSCOCO-Test-5K and Flickr30K-Test-1K. As a measure of out-of-distribution (OOD) generalization, prior works rely on zero-shot performance evaluated on one dataset (Flickr) using a VLM finetuned on another one (MSCOCO). We argue that such comparisons are insufficient to assess the OOD generalization capability of models due to high visual and linguistic similarity between the evaluation and finetuning datasets. To address this gap, we introduce WikiDO (drawn from Wikipedia Diversity Observatory), a novel cross-modal retrieval benchmark to assess the OOD generalization capabilities of pretrained VLMs.


Nearest Neighbor Normalization Improves Multimodal Retrieval

Chowdhury, Neil, Wang, Franklin, Shenoy, Sumedh, Kiela, Douwe, Schwettmann, Sarah, Thrush, Tristan

arXiv.org Artificial Intelligence

Multimodal models leverage large-scale pre-training to achieve strong but still imperfect performance on tasks such as image captioning, visual question answering, and cross-modal retrieval. In this paper, we present a simple and efficient method for correcting errors in trained contrastive image-text retrieval models with no additional training, called Nearest Neighbor Normalization (NNN). We show an improvement on retrieval metrics in both text retrieval and image retrieval for all of the contrastive models that we tested (CLIP, BLIP, ALBEF, SigLIP, BEiT) and for both of the datasets that we used (MS-COCO and Flickr30k). NNN requires a reference database, but does not require any training on this database, and can even increase the retrieval accuracy of a model after finetuning.


TinyGraph: Joint Feature and Node Condensation for Graph Neural Networks

Liu, Yezi, Shen, Yanning

arXiv.org Artificial Intelligence

Training graph neural networks (GNNs) on large-scale graphs can be challenging due to the high computational expense caused by the massive number of nodes and high-dimensional nodal features. Existing graph condensation studies tackle this problem only by reducing the number of nodes in the graph. However, the resulting condensed graph data can still be cumbersome. Specifically, although the nodes of the Citeseer dataset are reduced to 0.9% (30 nodes) in training, the number of features is 3,703, severely exceeding the training sample magnitude. Faced with this challenge, we study the problem of joint condensation for both features and nodes in large-scale graphs. This task is challenging mainly due to 1) the intertwined nature of the node features and the graph structure calls for the feature condensation solver to be structure-aware; and 2) the difficulty of keeping useful information in the condensed graph. To address these challenges, we propose a novel framework TinyGraph, to condense features and nodes simultaneously in graphs. Specifically, we cast the problem as matching the gradients of GNN weights trained on the condensed graph and the gradients obtained from training over the original graph, where the feature condensation is achieved by a trainable function. The condensed graph obtained by minimizing the matching loss along the training trajectory can henceforth retain critical information in the original graph. Extensive experiments were carried out to demonstrate the effectiveness of the proposed TinyGraph. For example, a GNN trained with TinyGraph retains 98.5% and 97.5% of the original test accuracy on the Cora and Citeseer datasets, respectively, while significantly reducing the number of nodes by 97.4% and 98.2%, and the number of features by 90.0% on both datasets.


How many views does your deep neural network use for prediction?

Kawano, Keisuke, Kutsuna, Takuro, Sano, Keisuke

arXiv.org Artificial Intelligence

The generalization ability of Deep Neural Networks (DNNs) is still not fully understood, despite numerous theoretical and empirical analyses. Recently, Allen-Zhu & Li (2023) introduced the concept of multi-views to explain the generalization ability of DNNs, but their main target is ensemble or distilled models, and no method for estimating multi-views used in a prediction of a specific input is discussed. In this paper, we propose Minimal Sufficient Views (MSVs), which is similar to multi-views but can be efficiently computed for real images. MSVs is a set of minimal and distinct features in an input, each of which preserves a model's prediction for the input. We empirically show that there is a clear relationship between the number of MSVs and prediction accuracy across models, including convolutional and transformer models, suggesting that a multi-view like perspective is also important for understanding the generalization ability of (non-ensemble or non-distilled) DNNs.


On the Transferability of Visually Grounded PCFGs

Zhao, Yanpeng, Titov, Ivan

arXiv.org Artificial Intelligence

There has been a significant surge of interest in visually grounded grammar induction in recent times. While a variety of models have been developed for the task and have demonstrated impressive performance, they have not been evaluated on text domains that are different from the training domain, so it is unclear if the improvements brought by visual groundings are transferable. Our study aims to fill this gap and assess the degree of transferability. We start by extending VC-PCFG (short for Visually-grounded Compound PCFG~\citep{zhao-titov-2020-visually}) in such a way that it can transfer across text domains. We consider a zero-shot transfer learning setting where a model is trained on the source domain and is directly applied to target domains, without any further training. Our experimental results suggest that: the benefits from using visual groundings transfer to text in a domain similar to the training domain but fail to transfer to remote domains. Further, we conduct data and result analysis; we find that the lexicon overlap between the source domain and the target domain is the most important factor in the transferability of VC-PCFG.


Which country is this picture from? New data and methods for DNN-based country recognition

Alamayreh, Omran, Dimitri, Giovanna Maria, Wang, Jun, Tondi, Benedetta, Barni, Mauro

arXiv.org Artificial Intelligence

Recognizing the country where a picture has been taken has many potential applications, such as identification of fake news and prevention of disinformation campaigns. Previous works focused on the estimation of the geo-coordinates where a picture has been taken. Yet, recognizing in which country an image was taken could be more critical, from a semantic and forensic point of view, than estimating its spatial coordinates. In the above framework, this paper provides two contributions. First, we introduce the VIPPGeo dataset, containing 3.8 million geo-tagged images. Secondly, we used the dataset to train a model casting the country recognition problem as a classification problem. The experiments show that our model provides better results than the current state of the art. Notably, we found that asking the network to identify the country provides better results than estimating the geo-coordinates and then tracing them back to the country where the picture was taken.