AITopics | Agrawal, Aishwarya

Measuring Progress in Fine-grained Vision-and-Language Understanding

Bugliarello, Emanuele, Sartran, Laurent, Agrawal, Aishwarya, Hendricks, Lisa Anne, Nematzadeh, Aida

arXiv.org Artificial IntelligenceMay-12-2023

First we consider: Which models perform well Fine-grained multimodal skills (e.g., understanding on fine-grained tasks? To answer this, we evaluate relationships and recognising verbs) require identifying models from four different model families trained and relating various entities across both image with different amounts of pretraining data, as well and text modalities. Vision-and-language models as recent architectures that leverage frozen large (VLMs) need such skills to robustly perform language models (LLMs). We observe that modelling well on real-world vision-and-language (V&L) applications; innovations have more impact than simply e.g., a coarse-grained model tested on scaling image captions from the Web. Furthermore, image retrieval to "find an image where something explicitly modelling localisation can improve is on a sofa" might incorrectly return an image of performance, but it is crucial how it is done, a cat sitting below the sofa. As another example, and simply using localisation data is not enough. in captioning, a model might incorrectly describe Our observations motivate our next question: an image where "someone is selling a sweater" as How do data and losses impact fine-grained understanding? "someone is buying a sweater," if it does not have a We focus our study on the best performing precise understanding of the two verbs.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2305.07558

Country:

Asia > Middle East (0.67)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Reassessing Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization

Agrawal, Aishwarya, Kajić, Ivana, Bugliarello, Emanuele, Davoodi, Elnaz, Gergely, Anita, Blunsom, Phil, Nematzadeh, Aida

arXiv.org Artificial IntelligenceApr-1-2023

Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks such as image captioning and visual question answering (VQA). The quality of such models is commonly assessed by measuring their performance on unseen data that typically comes from the same distribution as the training data. However, when evaluated under out-of-distribution (out-of-dataset) settings for VQA, we observe that these models exhibit poor generalization. We comprehensively evaluate two pretrained V&L models under different settings (i.e. classification and open-ended text generation) by conducting cross-dataset evaluations. We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task. We also find that in most cases generative models are less susceptible to shifts in data distribution compared to discriminative ones, and that multimodal pretraining is generally helpful for OOD generalization. Finally, we revisit assumptions underlying the use of automatic VQA evaluation metrics, and empirically show that their stringent nature repeatedly penalizes models for correct responses.

accuracy, machine learning, question answering, (21 more...)

arXiv.org Artificial Intelligence

2205.12191

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (0.46)
Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.61)

Add feedback

MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting

Mañas, Oscar, Rodriguez, Pau, Ahmadi, Saba, Nematzadeh, Aida, Goyal, Yash, Agrawal, Aishwarya

arXiv.org Artificial IntelligenceMar-14-2023

Large pre-trained models have proved to be remarkable zero- and (prompt-based) few-shot learners in unimodal vision and language tasks. We propose MAPL, a simple and parameter-efficient method that reuses frozen pre-trained unimodal models and leverages their strong generalization capabilities in multimodal vision-language (VL) settings. MAPL learns a lightweight mapping between the representation spaces of unimodal models using aligned image-text data, and can generalize to unseen VL tasks from just a few in-context examples. The small number of trainable parameters makes MAPL effective at low-data and in-domain learning. Moreover, MAPL's modularity enables easy extension to other pre-trained models. Extensive experiments on several visual question answering and image captioning benchmarks show that MAPL achieves superior or competitive performance compared to similar methods while training orders of magnitude fewer parameters. MAPL can be trained in just a few hours using modest computational resources and public datasets. We release our code and pre-trained model weights at https://github.com/mair-lab/mapl.

machine learning, mapl, natural language, (16 more...)

arXiv.org Artificial Intelligence

2210.07179

Country:

North America > United States (0.28)
North America > Canada > Quebec (0.14)

Genre: Research Report (0.63)

Industry:

Transportation (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Overcoming Language Priors in Visual Question Answering with Adversarial Regularization

Ramakrishnan, Sainandan, Agrawal, Aishwarya, Lee, Stefan

Neural Information Processing SystemsDec-31-2018

Modern Visual Question Answering (VQA) models have been shown to rely heavily on superficial correlations between question and answer words learned during training -- \eg overwhelmingly reporting the type of room as kitchen or the sport being played as tennis, irrespective of the image. Most alarmingly, this shortcoming is often not well reflected during evaluation because the same strong priors exist in test distributions; however, a VQA system that fails to ground questions in image content would likely perform poorly in real-world settings. In this work, we present a novel regularization scheme for VQA that reduces this effect. We introduce a question-only model that takes as input the question encoding from the VQA model and must leverage language biases in order to succeed. We then pose training as an adversarial game between the VQA model and this question-only adversary -- discouraging the VQA model from capturing language biases in its question encoding.Further, we leverage this question-only model to estimate the mutual information between the image and answer given the question, which we maximize explicitly to encourage visual grounding. Our approach is a model agnostic training procedure and simple to implement. We show empirically that it can improve performance significantly on a bias-sensitive split of the VQA dataset for multiple base models -- achieving state-of-the-art on this task. Further, on standard VQA tasks, our approach shows significantly less drop in accuracy compared to existing bias-reducing VQA models.

machine learning, natural language, question answering, (18 more...)

Neural Information Processing Systems

Country: North America > Canada (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Sports > Tennis (0.45)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.72)

Add feedback

Overcoming Language Priors in Visual Question Answering with Adversarial Regularization

Ramakrishnan, Sainandan, Agrawal, Aishwarya, Lee, Stefan

Neural Information Processing SystemsDec-31-2018

Modern Visual Question Answering (VQA) models have been shown to rely heavily on superficial correlations between question and answer words learned during training -- \eg overwhelmingly reporting the type of room as kitchen or the sport being played as tennis, irrespective of the image. Most alarmingly, this shortcoming is often not well reflected during evaluation because the same strong priors exist in test distributions; however, a VQA system that fails to ground questions in image content would likely perform poorly in real-world settings. In this work, we present a novel regularization scheme for VQA that reduces this effect. We introduce a question-only model that takes as input the question encoding from the VQA model and must leverage language biases in order to succeed. We then pose training as an adversarial game between the VQA model and this question-only adversary -- discouraging the VQA model from capturing language biases in its question encoding.Further, we leverage this question-only model to estimate the mutual information between the image and answer given the question, which we maximize explicitly to encourage visual grounding. Our approach is a model agnostic training procedure and simple to implement. We show empirically that it can improve performance significantly on a bias-sensitive split of the VQA dataset for multiple base models -- achieving state-of-the-art on this task. Further, on standard VQA tasks, our approach shows significantly less drop in accuracy compared to existing bias-reducing VQA models.

deep learning, neural network, vqa model, (19 more...)

Neural Information Processing Systems

Country: North America > Canada (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Sports > Tennis (0.45)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.72)

Add feedback

Generating Diverse Programs with Instruction Conditioned Reinforced Adversarial Learning

Agrawal, Aishwarya, Malinowski, Mateusz, Hill, Felix, Eslami, Ali, Vinyals, Oriol, Kulkarni, Tejas

arXiv.org Machine LearningDec-3-2018

Advances in Deep Reinforcement Learning have led to agents that perform well across a variety of sensory-motor domains. In this work, we study the setting in which an agent must learn to generate programs for diverse scenes conditioned on a given symbolic instruction. Final goals are specified to our agent via images of the scenes. A symbolic instruction consistent with the goal images is used as the conditioning input for our policies. Since a single instruction corresponds to a diverse setof different but still consistent end-goal images, the agent needs to learn to generate a distribution over programs given an instruction. We demonstrate that with simple changes to the reinforced adversarial learning [8] objective, we can learn instruction conditioned policies to achieve the corresponding diverse set of goals. Most importantly, our agent's stochastic policy is shown to more accurately capture the diversity in the goal distribution than a fixed pixel-based reward function baseline.We demonstrate the efficacy of our approach on two domains: (1) drawing MNIST digits with a paint software conditioned on instructions and (2) constructing scenes in a 3D editor that satisfies a certain instruction.

deep learning, instruction, neural network, (18 more...)

arXiv.org Machine Learning

1812.00898

Country: North America > United States (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.33)

Add feedback

Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering

Agrawal, Aishwarya, Batra, Dhruv, Parikh, Devi, Kembhavi, Aniruddha

arXiv.org Artificial IntelligenceJun-3-2018

A number of studies have found that today's Visual Question Answering (VQA) models are heavily driven by superficial correlations in the training data and lack sufficient image grounding. To encourage development of models geared towards the latter, we propose a new setting for VQA where for every question type, train and test sets have different prior distributions of answers. Specifically, we present new splits of the VQA v1 and VQA v2 datasets, which we call Visual Question Answering under Changing Priors (VQA-CP v1 and VQA-CP v2 respectively). First, we evaluate several existing VQA models under this new setting and show that their performance degrades significantly compared to the original VQA setting. Second, we propose a novel Grounded Visual Question Answering model (GVQA) that contains inductive biases and restrictions in the architecture specifically designed to prevent the model from 'cheating' by primarily relying on priors in the training data. Specifically, GVQA explicitly disentangles the recognition of visual concepts present in the image from the identification of plausible answer space for a given question, enabling the model to more robustly generalize across different distributions of answers. GVQA is built off an existing VQA model -- Stacked Attention Networks (SAN). Our experiments demonstrate that GVQA significantly outperforms SAN on both VQA-CP v1 and VQA-CP v2 datasets. Interestingly, it also outperforms more powerful VQA models such as Multimodal Compact Bilinear Pooling (MCB) in several cases. GVQA offers strengths complementary to SAN when trained and evaluated on the original VQA v1 and VQA v2 datasets. Finally, GVQA is more transparent and interpretable than existing VQA models.

deep learning, gvqa, neural network, (21 more...)

arXiv.org Artificial Intelligence

1712.00377

Country: North America > United States (0.14)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Sports (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Measuring Machine Intelligence Through Visual Question Answering

Zitnick, C. Lawrence (Facebook AI Research) | Agrawal, Aishwarya (Virginia Institute of Technology) | Antol, Stanislaw (Virginia Institute of Technology) | Mitchell, Margaret (Microsoft Research) | Batra, Dhruv (Virginia Institute of Technology) | Parikh, Devi (Virginia Institute of Technology)

AI MagazineApr-13-2016

We begin with a case study exploring the recently popular task of image captioning and its limitations as a task for measuring machine intelligence. An alternative and more promising task is Visual Question Answering that tests a machine's ability to reason about language and vision. We describe a dataset unprecedented in size created for the task that contains over 760,000 human generated questions about images. Using around 10 million human generated answers, machines may be easily evaluated.

management and information, natural language, question answering, (3 more...)

AI Magazine

Technology: Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.68)

Add feedback

Measuring Machine Intelligence Through Visual Question Answering

Zitnick, C. Lawrence (Facebook AI Research) | Agrawal, Aishwarya (Virginia Institute of Technology) | Antol, Stanislaw (Virginia Institute of Technology) | Mitchell, Margaret (Microsoft Research) | Batra, Dhruv (Virginia Institute of Technology) | Parikh, Devi (Virginia Institute of Technology)

AI MagazineApr-13-2016

As machines have become more intelligent, there has been a renewed interest in methods for measuring their intelligence. A common approach is to propose tasks for which a human excels, but one which machines find difficult. However, an ideal task should also be easy to evaluate and not be easily gameable. We begin with a case study exploring the recently popular task of image captioning and its limitations as a task for measuring machine intelligence. An alternative and more promising task is Visual Question Answering that tests a machine’s ability to reason about language and vision. We describe a dataset unprecedented in size created for the task that contains over 760,000 human generated questions about images. Using around 10 million human generated answers, machines may be easily evaluated.

artificial intelligence, caption, natural language, (19 more...)

AI Magazine

Country: North America > United States > New York (0.14)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.73)

Add feedback

Filters

Collaborating Authors

Agrawal, Aishwarya

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Measuring Progress in Fine-grained Vision-and-Language Understanding

Reassessing Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization

MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting

Overcoming Language Priors in Visual Question Answering with Adversarial Regularization

Overcoming Language Priors in Visual Question Answering with Adversarial Regularization

Generating Diverse Programs with Instruction Conditioned Reinforced Adversarial Learning

Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering

Measuring Machine Intelligence Through Visual Question Answering

Measuring Machine Intelligence Through Visual Question Answering