AITopics | Ghosh, Dhruba

Collaborating Authors

Ghosh, Dhruba

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Why are Visually-Grounded Language Models Bad at Image Classification?

Zhang, Yuhui, Unell, Alyssa, Wang, Xiaohan, Ghosh, Dhruba, Su, Yuchang, Schmidt, Ludwig, Yeung-Levy, Serena

arXiv.org Artificial IntelligenceMay-28-2024

Image classification is one of the most fundamental capabilities of machine vision intelligence. In this work, we revisit the image classification task using visually-grounded language models (VLMs) such as GPT-4V and LLaVA. We find that existing proprietary and public VLMs, despite often using CLIP as a vision encoder and having many more parameters, significantly underperform CLIP on standard image classification benchmarks like ImageNet. To understand the reason, we explore several hypotheses concerning the inference algorithms, training objectives, and data processing in VLMs. Our analysis reveals that the primary cause is data-related: critical information for image classification is encoded in the VLM's latent space but can only be effectively decoded with enough training data. Specifically, there is a strong correlation between the frequency of class exposure during VLM training and instruction-tuning and the VLM's performance in those classes; when trained with sufficient data, VLMs can match the accuracy of state-of-the-art classification models. Based on these findings, we enhance a VLM by integrating classification-focused datasets into its training, and demonstrate that the enhanced classification performance of the VLM transfers to its general capabilities, resulting in an improvement of 11.8% on the newly collected ImageWikiQA dataset.

large language model, machine learning, vlm, (20 more...)

arXiv.org Artificial Intelligence

2405.18415

Country: North America > United States > California (0.14)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology > Software (0.34)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision > Image Understanding (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
(2 more...)

Add feedback

DataComp: In search of the next generation of multimodal datasets

Gadre, Samir Yitzhak, Ilharco, Gabriel, Fang, Alex, Hayase, Jonathan, Smyrnis, Georgios, Nguyen, Thao, Marten, Ryan, Wortsman, Mitchell, Ghosh, Dhruba, Zhang, Jieyu, Orgad, Eyal, Entezari, Rahim, Daras, Giannis, Pratt, Sarah, Ramanujan, Vivek, Bitton, Yonatan, Marathe, Kalyani, Mussmann, Stephen, Vencu, Richard, Cherti, Mehdi, Krishna, Ranjay, Koh, Pang Wei, Saukh, Olga, Ratner, Alexander, Song, Shuran, Hajishirzi, Hannaneh, Farhadi, Ali, Beaumont, Romain, Oh, Sewoong, Dimakis, Alex, Jitsev, Jenia, Carmon, Yair, Shankar, Vaishaal, Schmidt, Ludwig

arXiv.org Artificial IntelligenceOct-20-2023

Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. In particular, our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release DataComp and all accompanying code at www.datacomp.ai.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2304.14108

Country:

Asia (0.67)
Europe (0.67)
North America > United States > Illinois (0.14)
North America > United States > Texas (0.14)

Genre:

Research Report > New Finding (1.00)
Overview (0.92)

Industry:

Law (1.00)
Information Technology > Security & Privacy (0.92)
Health & Medicine (0.92)
Energy (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

Ghosh, Dhruba, Hajishirzi, Hanna, Schmidt, Ludwig

arXiv.org Artificial IntelligenceOct-17-2023

Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning have led to an explosion of text-to-image generative models. Given human evaluation is expensive and difficult to scale, automated methods are critical for evaluating the increasingly large number of new models. However, most current automated evaluation metrics like FID or CLIPScore only offer a holistic measure of image quality or image-text alignment, and are unsuited for fine-grained or instance-level analysis. In this paper, we introduce GenEval, an object-focused framework to evaluate compositional image properties such as object co-occurrence, position, count, and color. We show that current object detection models can be leveraged to evaluate text-to-image models on a variety of generation tasks with strong human agreement, and that other discriminative vision models can be linked to this pipeline to further verify properties like object color. We then evaluate several open-source text-to-image models and analyze their relative generative capabilities on our benchmark. We find that recent models demonstrate significant improvement on these tasks, though they are still lacking in complex capabilities such as spatial relations and attribute binding. Finally, we demonstrate how GenEval might be used to help discover existing failure modes, in order to inform development of the next generation of text-to-image models. Our code to run the GenEval framework is publicly available at https://github.com/djghosh13/geneval.

artificial intelligence, object-focused framework, text-to-image alignment, (1 more...)

arXiv.org Artificial Intelligence

2310.11513

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Vision (1.00)

Add feedback

Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level

Zhong, Ruiqi, Ghosh, Dhruba, Klein, Dan, Steinhardt, Jacob

arXiv.org Artificial IntelligenceMay-12-2021

Larger language models have higher accuracy on average, but are they better on every single instance (datapoint)? Some work suggests larger models have higher out-of-distribution robustness, while other work suggests they have lower accuracy on rare subgroups. To understand these differences, we investigate these models at the level of individual instances. However, one major challenge is that individual predictions are highly sensitive to noise in the randomness in training. We develop statistically rigorous methods to address this, and after accounting for pretraining and finetuning noise, we find that our BERT-Large is worse than BERT-Mini on at least 1-4% of instances across MNLI, SST-2, and QQP, compared to the overall accuracy improvement of 2-10%. We also find that finetuning noise increases with model size and that instance-level accuracy has momentum: improvement from BERT-Mini to BERT-Medium correlates with improvement from BERT-Medium to BERT-Large. Our findings suggest that instance-level predictions provide a rich source of information; we therefore, recommend that researchers supplement model weights with model predictions.

category, health & medicine, us government, (21 more...)

arXiv.org Artificial Intelligence

2105.0602

Country:

Europe (1.00)
Asia (0.67)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre:

Research Report > New Finding (0.68)
Personal > Interview (0.46)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)
Health & Medicine (1.00)
Government > Regional Government > North America Government > United States Government (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback