AITopics | Ray, Arijit

Collaborating Authors

Ray, Arijit

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SAT: Spatial Aptitude Training for Multimodal Language Models

Ray, Arijit, Duan, Jiafei, Tan, Reuben, Bashkirova, Dina, Hendrix, Rose, Ehsani, Kiana, Kembhavi, Aniruddha, Plummer, Bryan A., Krishna, Ranjay, Zeng, Kuo-Hao, Saenko, Kate

arXiv.org Artificial IntelligenceDec-10-2024

Spatial perception is a fundamental component of intelligence. While many studies highlight that large multimodal language models (MLMs) struggle to reason about space, they only test for static spatial reasoning, such as categorizing the relative positions of objects. Meanwhile, real-world deployment requires dynamic capabilities like perspective-taking and egocentric action recognition. As a roadmap to improving spatial intelligence, we introduce SAT, Spatial Aptitude Training, which goes beyond static relative object position questions to the more dynamic tasks. SAT contains 218K question-answer pairs for 22K synthetic scenes across a training and testing set. Generated using a photo-realistic physics engine, our dataset can be arbitrarily scaled and easily extended to new actions, scenes, and 3D assets. We find that even MLMs that perform relatively well on static questions struggle to accurately answer dynamic spatial questions. Further, we show that SAT instruction-tuning data improves not only dynamic spatial reasoning on SAT, but also zero-shot performance on existing real-image spatial benchmarks: $23\%$ on CVBench, $8\%$ on the harder BLINK benchmark, and $18\%$ on VSR. When instruction-tuned on SAT, our 13B model matches larger proprietary MLMs like GPT4-V and Gemini-3-1.0 in spatial reasoning. Our data/code is available at http://arijitray1993.github.io/SAT/ .

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2412.07755

Country:

North America > United States (0.28)
Asia (0.28)

Genre: Research Report (0.81)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Spatial Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

BloomVQA: Assessing Hierarchical Multi-modal Comprehension

Gong, Yunye, Shrestha, Robik, Claypoole, Jared, Cogswell, Michael, Ray, Arijit, Kanan, Christopher, Divakaran, Ajay

arXiv.org Artificial IntelligenceDec-19-2023

We propose a novel VQA dataset, based on picture stories designed for educating young children, that aims to facilitate comprehensive evaluation and characterization of vision-language models on comprehension tasks. Unlike current VQA datasets that often focus on fact-based memorization and simple reasoning tasks without principled scientific grounding, we collect data containing tasks reflecting different levels of comprehension and underlying cognitive processes, as laid out in Bloom's Taxonomy, a classic framework widely adopted in education research. The proposed BloomVQA dataset can be mapped to a hierarchical graph-based representation of visual stories, enabling automatic data augmentation and novel measures characterizing model consistency across the underlying taxonomy. We demonstrate graded evaluation and reliability analysis based on our proposed consistency metrics on state-of-the-art vision-language models. Our results suggest that, while current models achieve the most gain on low-level comprehension tasks, they generally fall short on high-level tasks requiring more advanced comprehension and cognitive skills, as 38.0% drop in VQA accuracy is observed comparing lowest and highest level tasks. Furthermore, current models show consistency patterns misaligned with human comprehension in various scenarios, suggesting emergent structures of model behaviors.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2312.12716

Genre: Research Report > New Finding (0.54)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Socratis: Are large multimodal models emotionally aware?

Deng, Katherine, Ray, Arijit, Tan, Reuben, Gabriel, Saadia, Plummer, Bryan A., Saenko, Kate

arXiv.org Artificial IntelligenceNov-2-2023

Existing emotion prediction benchmarks contain coarse emotion labels which do not consider the diversity of emotions that an image and text can elicit in humans due to various reasons. Learning diverse reactions to multimodal content is important as intelligent machines take a central role in generating and delivering content to society. To address this gap, we propose Socratis, a societal reactions benchmark, where each image-caption (IC) pair is annotated with multiple emotions and the reasons for feeling them. Socratis contains 18K free-form reactions for 980 emotions on 2075 image-caption pairs from 5 widely-read news and image-caption (IC) datasets. We benchmark the capability of state-of-the-art multimodal large language models to generate the reasons for feeling an emotion given an IC pair. Based on a preliminary human study, we observe that humans prefer human-written reasons over 2 times more often than machine-generated ones. This shows our task is harder than standard generation tasks because it starkly contrasts recent findings where humans cannot tell apart machine vs human-written news articles, for instance. We further see that current captioning metrics based on large vision-language models also fail to correlate with human preferences. We hope that these findings and our benchmark will inspire further research on training emotionally aware models.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2308.16741

Country: North America > United States (0.47)

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.47)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (0.71)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)

Add feedback

Language-Guided Audio-Visual Source Separation via Trimodal Consistency

Tan, Reuben, Ray, Arijit, Burns, Andrea, Plummer, Bryan A., Salamon, Justin, Nieto, Oriol, Russell, Bryan, Saenko, Kate

arXiv.org Artificial IntelligenceSep-23-2023

We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data. A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform, all without access to annotations during training. To overcome this challenge, we adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions and encourage a stronger alignment between the audio, visual and natural language modalities. During inference, our approach can separate sounds given text, video and audio input, or given text and audio input alone. We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets, including MUSIC, SOLOS and AudioSet, where we outperform state-of-the-art strongly supervised approaches despite not using object detectors or text labels during training.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2303.16342

Genre: Research Report (0.82)

Industry:

Leisure & Entertainment (0.93)
Media > Music (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Improving Users' Mental Model with Attention-directed Counterfactual Edits

Alipour, Kamran, Ray, Arijit, Lin, Xiao, Cogswell, Michael, Schulze, Jurgen P., Yao, Yi, Burachas, Giedrius T.

arXiv.org Artificial IntelligenceOct-15-2021

In the domain of Visual Question Answering (VQA), studies have shown improvement in users' mental model of the VQA system when they are exposed to examples of how these systems answer certain Image-Question (IQ) pairs. In this work, we show that showing controlled counterfactual image-question examples are more effective at improving the mental model of users as compared to simply showing random examples. We compare a generative approach and a retrieval-based approach to show counterfactual examples. We use recent advances in generative adversarial networks (GANs) to generate counterfactual images by deleting and inpainting certain regions of interest in the image. We then expose users to changes in the VQA system's answer on those altered images. To select the region of interest for inpainting, we experiment with using both human-annotated attention maps and a fully automatic method that uses the VQA system's attention values. Finally, we test the user's mental model by asking them to predict the model's performance on a test counterfactual image. We note an overall improvement in users' accuracy to predict answer change when shown counterfactual explanations. While realistic retrieved counterfactuals obviously are the most effective at improving the mental model, we show that a generative approach can also be equally effective.

machine learning, natural language, question answering, (20 more...)

arXiv.org Artificial Intelligence

2110.06863

Country: North America > United States > California > San Diego County > La Jolla (0.14)

Genre: Research Report > New Finding (0.68)

Industry: Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (0.50)

Add feedback

Knowing What VQA Does Not: Pointing to Error-Inducing Regions to Improve Explanation Helpfulness

Ray, Arijit, Cogswell, Michael, Lin, Xiao, Alipour, Kamran, Divakaran, Ajay, Yao, Yi, Burachas, Giedrius

arXiv.org Artificial IntelligenceMar-31-2021

Attention maps, a popular heatmap-based explanation method for Visual Question Answering (VQA), are supposed to help users understand the model by highlighting portions of the image/question used by the model to infer answers. However, we see that users are often misled by current attention map visualizations that point to relevant regions despite the model producing an incorrect answer. Hence, we propose Error Maps that clarify the error by highlighting image regions where the model is prone to err. Error maps can indicate when a correctly attended region may be processed incorrectly leading to an incorrect answer, and hence, improve users' understanding of those cases. To evaluate our new explanations, we further introduce a metric that simulates users' interpretation of explanations to evaluate their potential helpfulness to understand model correctness. We finally conduct user studies to see that our new explanations help users understand model correctness better than baselines by an expected 30% and that our proxy helpfulness metrics correlate strongly ($\rho$>0.97) with how well users can predict model correctness.

artificial intelligence, explanation, us government, (19 more...)

arXiv.org Artificial Intelligence

2103.14712

Country: North America > United States > California (0.68)

Genre: Research Report (1.00)

Industry: Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (0.46)

Add feedback

The Impact of Explanations on AI Competency Prediction in VQA

Alipour, Kamran, Ray, Arijit, Lin, Xiao, Schulze, Jurgen P., Yao, Yi, Burachas, Giedrius T.

arXiv.org Artificial IntelligenceJul-2-2020

Explainability is one of the key elements for building trust in AI systems. Among numerous attempts to make AI explainable, quantifying the effect of explanations remains a challenge in conducting human-AI collaborative tasks. Aside from the ability to predict the overall behavior of AI, in many applications, users need to understand an AI agent's competency in different aspects of the task domain. In this paper, we evaluate the impact of explanations on the user's mental model of AI agent competency within the task of visual question answering (VQA). We quantify users' understanding of competency, based on the correlation between the actual system performance and user rankings. We introduce an explainable VQA system that uses spatial and object features and is powered by the BERT language model. Each group of users sees only one kind of explanation to rank the competencies of the VQA model. The proposed model is evaluated through between-subject experiments to probe explanations' impact on the user's perception of competency. The comparison between two VQA models shows BERT based explanations and the use of object features improve the user's prediction of the model's competencies.

deep learning, explanation, neural network, (20 more...)

arXiv.org Artificial Intelligence

2007.009

Country: North America > United States > California (0.93)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Vision (0.96)
Information Technology > Sensing and Signal Processing > Image Processing (0.93)
(2 more...)

Add feedback

Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation

Ray, Arijit, Sikka, Karan, Divakaran, Ajay, Lee, Stefan, Burachas, Giedrius

arXiv.org Artificial IntelligenceSep-10-2019

While models for Visual Question Answering (VQA) have steadily improved over the years, interacting with one quickly reveals that these models lack consistency. For instance, if a model answers "red" to "What color is the balloon?", it might answer "no" if asked, "Is the balloon red?". These responses violate simple notions of entailment and raise questions about how effectively VQA models ground language. In this work, we introduce a dataset, ConVQA, and metrics that enable quantitative evaluation of consistency in VQA. For a given observable fact in an image (e.g. the balloon's color), we generate a set of logically consistent question-answer (QA) pairs (e.g. Is the balloon red?) and also collect a human-annotated set of common-sense based consistent QA pairs (e.g. Is the balloon the same color as tomato sauce?). Further, we propose a consistency-improving data augmentation module, a Consistency Teacher Module (CTM). CTM automatically generates entailed (or similar-intent) questions for a source QA pair and fine-tunes the VQA model if the VQA's answer to the entailed question is consistent with the source QA pair. We demonstrate that our CTM-based training improves the consistency of VQA models on the ConVQA datasets and is a strong baseline for further research.

artificial intelligence, consistency, natural language, (14 more...)

arXiv.org Artificial Intelligence

1909.04696

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)

Add feedback

Generating Natural Language Explanations for Visual Question Answering using Scene Graphs and Visual Attention

Ghosh, Shalini, Burachas, Giedrius, Ray, Arijit, Ziskind, Avi

arXiv.org Artificial IntelligenceFeb-15-2019

In this paper, we present a novel approach for the task of eXplainable Question Answering (XQA), i.e., generating natural language (NL) explanations for the Visual Question Answering (VQA) problem. We generate NL explanations comprising of the evidence to support the answer to a question asked to an image using two sources of information: (a) annotations of entities in an image (e.g., object labels, region descriptions, relation phrases) generated from the scene graph of the image, and (b) the attention map generated by a VQA model when answering the question. We show how combining the visual attention map with the NL representation of relevant scene graph entities, carefully selected using a language model, can give reasonable textual explanations without the need of any additional collected data (explanation captions, etc). We run our algorithms on the Visual Genome (VG) dataset and conduct internal user-studies to demonstrate the efficacy of our approach over a strong baseline. We have also released a live web demo showcasing our VQA and textual explanation generation using scene graphs and visual attention.

explanation, survey article, us government, (22 more...)

arXiv.org Artificial Intelligence

1902.05715

Country: North America > United States (0.68)

Genre:

Research Report (0.84)
Overview > Innovation (0.34)

Industry:

Leisure & Entertainment > Sports > Tennis (0.80)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.82)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.69)
Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (0.51)

Add feedback