AITopics | vqascore

Collaborating Authors

vqascore

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

M$^{3}$T2IBench: A Large-Scale Multi-Category, Multi-Instance, Multi-Relation Text-to-Image Benchmark

Zhang, Huixuan, Wan, Xiaojun

arXiv.org Artificial IntelligenceOct-28-2025

Text-to-image models are known to struggle with generating images that perfectly align with textual prompts. Several previous studies have focused on evaluating image-text alignment in text-to-image generation. However, these evaluations either address overly simple scenarios, especially overlooking the difficulty of prompts with multiple different instances belonging to the same category, or they introduce metrics that do not correlate well with human evaluation. Our findings reveal that current open-source text-to-image models perform poorly on this challenging benchmark. Additionally, we propose the Revise-Then-Enforce approach to enhance image-text alignment. This training-free post-editing method demonstrates improvements in image-text alignment across a broad range of diffusion models. Text-to-Image (T2I) models have demonstrated impressive performance in generating high-quality, realistic images (Betker et al., 2023; Esser et al., 2024). Despite this success, T2I models continue to struggle with accurately interpreting and following user prompts. They may fail to generate objects with the correct number, attributes, or relationships (Li et al., 2024). However, assessing the alignment between text and generated image has remained a longstanding challenge. There are generally three approaches to evaluating image-text alignment. The first approach involves using pretrained image-text models to generate an overall alignment score. CLIP Score (Hessel et al., 2021) is a widely used metric, while VQAScore (Lin et al., 2024) is an improved version of CLIP Score. However, these metrics have several limitations, including their inability to accurately reflect the true alignment between the image and the text (Li et al., 2024) and failing to provide explainable evaluation results. Figure 1: A failure case generated by Stable-Diffusion-3.

benchmark, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2510.2302

Country: Europe > Switzerland (0.28)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)

Add feedback

A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation

Wang, Andrew Z., Ge, Songwei, Karras, Tero, Liu, Ming-Yu, Balaji, Yogesh

arXiv.org Artificial IntelligenceJun-17-2025

Both text-to-image generation and large language models (LLMs) have made significant advancements. However, many text-to-image models still employ the somewhat outdated T5 and CLIP as their text encoders. In this work, we investigate the effectiveness of using modern decoder-only LLMs as text encoders for text-to-image diffusion models. W e build a standardized training and evaluation pipeline that allows us to isolate and evaluate the effect of different text embeddings. W e train a total of 27 text-to-image models with 12 different text encoders to analyze the critical aspects of LLMs that could impact text-to-image generation, including the approaches to extract embeddings, different LLMs variants, and model sizes. Our experiments reveal that the de facto way of using last-layer embeddings as conditioning leads to inferior performance. Instead, we explore embeddings from various layers and find that using layer-normalized averaging across all layers significantly improves alignment with complex prompts. Most LLMs with this conditioning outperform the baseline T5 model, showing enhanced performance in advanced visio-linguistic reasoning skills.

arxiv preprint arxiv, large language model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2506.0821

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Re-Thinking the Automatic Evaluation of Image-Text Alignment in Text-to-Image Models

Zhang, Huixuan, Wan, Xiaojun

arXiv.org Artificial IntelligenceJun-11-2025

Text-to-image models often struggle to generate images that precisely match textual prompts. Prior research has extensively studied the evaluation of image-text alignment in text-to-image generation. However, existing evaluations primarily focus on agreement with human assessments, neglecting other critical properties of a trustworthy evaluation framework. In this work, we first identify two key aspects that a reliable evaluation should address. We then empirically demonstrate that current mainstream evaluation frameworks fail to fully satisfy these properties across a diverse range of metrics and models. Finally, we propose recommendations for improving image-text alignment evaluation.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2506.0848

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)

Genre: Research Report > Experimental Study (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

CROC: Evaluating and Training T2I Metrics with Pseudo- and Human-Labeled Contrastive Robustness Checks

Leiter, Christoph, Asano, Yuki M., Keuper, Margret, Eger, Steffen

arXiv.org Artificial IntelligenceMay-19-2025

The assessment of evaluation metrics (meta-evaluation) is crucial for determining the suitability of existing metrics in text-to-image (T2I) generation tasks. Human-based meta-evaluation is costly and time-intensive, and automated alternatives are scarce. We address this gap and propose CROC: a scalable framework for automated Contrastive Robustness Checks that systematically probes and quantifies metric robustness by synthesizing contrastive test cases across a comprehensive taxonomy of image properties. With CROC, we generate a pseudo-labeled dataset (CROC$^{syn}$) of over one million contrastive prompt-image pairs to enable a fine-grained comparison of evaluation metrics. We also use the dataset to train CROCScore, a new metric that achieves state-of-the-art performance among open-source methods, demonstrating an additional key application of our framework. To complement this dataset, we introduce a human-supervised benchmark (CROC$^{hum}$) targeting especially challenging categories. Our results highlight robustness issues in existing metrics: for example, many fail on prompts involving negation, and all tested open-source metrics fail on at least 25% of cases involving correct identification of body parts.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2505.11314

Country:

North America > United States (0.46)
Asia > Middle East (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Multi-Modal Language Models as Text-to-Image Model Evaluators

Chen, Jiahui, Ross, Candace, Askari-Hemmat, Reyhane, Sinha, Koustuv, Hall, Melissa, Drozdzal, Michal, Romero-Soriano, Adriana

arXiv.org Artificial IntelligenceMay-14-2025

The steady improvements of text-to-image (T2I) generative models lead to slow deprecation of automatic evaluation benchmarks that rely on static datasets, motivating researchers to seek alternative ways to evaluate the T2I progress. In this paper, we explore the potential of multi-modal large language models (MLLMs) as evaluator agents that interact with a T2I model, with the objective of assessing prompt-generation consistency and image aesthetics. We present Multimodal Text-to-Image Eval (MT2IE), an evaluation framework that iteratively generates prompts for evaluation, scores generated images and matches T2I evaluation of existing benchmarks with a fraction of the prompts used in existing static benchmarks. Moreover, we show that MT2IE's prompt-generation consistency scores have higher correlation with human judgment than scores previously introduced in the literature. MT2IE generates prompts that are efficient at probing T2I model performance, producing the same relative T2I model rankings as existing benchmarks while using only 1/80th the number of prompts for evaluation.

benchmark, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2505.00759

Country:

North America > Canada > Quebec > Montreal (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Texas > Travis County > Austin (0.04)
(2 more...)

Genre: Research Report (0.66)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)

Add feedback

VQAScore: Evaluating and improving vision-language generative models

AIHubNov-6-2024, 12:11:26 GMT

Text-to-image/video models like Midjourney, Imagen3, Stable Diffusion, and Sora can generate aesthetic, photo-realistic visuals from natural language prompts, for example, given "Several giant woolly mammoths approach, treading through a snowy meadow…", Sora generates: But how do we know if these models generate what we desire? For example, if the prompt is "The brown dog chases the black dog around a tree", how can we tell if the model shows the dogs "chasing around a tree" rather than "playing in a backyard"? More generally, how should we evaluate these generative models? While humans can easily judge whether a generated image aligns with a prompt, large-scale human evaluation is costly. To address this, we introduce a new evaluation metric (VQAScore) and benchmark dataset (GenAI-Bench) [Lin et al., ECCV 2024] for automated evaluation of text-to-visual generative models.

artificial intelligence, machine learning, natural language, (20 more...)

AIHub

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.82)

Add feedback

GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

Li, Baiqi, Lin, Zhiqiu, Pathak, Deepak, Li, Jiayao, Fei, Yixin, Wu, Kewen, Ling, Tiffany, Xia, Xide, Zhang, Pengchuan, Neubig, Graham, Ramanan, Deva

arXiv.org Artificial IntelligenceJun-21-2024

While text-to-visual models now produce photo-realistic images and videos, they struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison. In this work, we conduct an extensive human study on GenAI-Bench to evaluate the performance of leading image and video generation models in various aspects of compositional text-to-visual generation. We also compare automated evaluation metrics against our collected human ratings and find that VQAScore -- a metric measuring the likelihood that a VQA model views an image as accurately depicting the prompt -- significantly outperforms previous metrics such as CLIPScore. In addition, VQAScore can improve generation in a black-box manner (without finetuning) via simply ranking a few (3 to 9) candidate images. Ranking by VQAScore is 2x to 3x more effective than other scoring methods like PickScore, HPSv2, and ImageReward at improving human alignment ratings for DALL-E 3 and Stable Diffusion, especially on compositional prompts that require advanced visio-linguistic reasoning. We will release a new GenAI-Rank benchmark with over 40,000 human ratings to evaluate scoring metrics on ranking images generated from the same prompt. Lastly, we discuss promising areas for improvement in VQAScore, such as addressing fine-grained visual details. We will release all human ratings (over 80,000) to facilitate scientific benchmarking of both generative models and automated metrics.

arxiv preprint arxiv, genai-bench, vqascore, (15 more...)

arXiv.org Artificial Intelligence

2406.13743

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Europe > United Kingdom > England > Gloucestershire > Cheltenham (0.04)

Genre: Research Report (1.00)

Industry:

Education (0.46)
Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.36)

Add feedback

Evaluating Text-to-Visual Generation with Image-to-Text Generation

Lin, Zhiqiu, Pathak, Deepak, Li, Baiqi, Li, Jiayao, Xia, Xide, Neubig, Graham, Zhang, Pengchuan, Ramanan, Deva

arXiv.org Artificial IntelligenceJun-18-2024

Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reason is that text encoders of CLIP can notoriously act as a "bag of words", conflating prompts such as "the horse is eating the grass" with "the grass is eating the horse". To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a "Yes" answer to a simple "Does this figure show '{text}'?" question. Though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art results across many (8) image-text alignment benchmarks. We also compute VQAScore with an in-house model that follows best practices in the literature. For example, we use a bidirectional image-question encoder that allows image embeddings to depend on the question being asked (and vice versa). Our in-house model, CLIP-FlanT5, outperforms even the strongest baselines that make use of the proprietary GPT-4V. Interestingly, although we train with only images, VQAScore can also align text with video and 3D models. VQAScore allows researchers to benchmark text-to-visual generation using complex texts that capture the compositional structure of real-world prompts. We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts that require parsing scenes, objects, attributes, relationships, and high-order reasoning like comparison and logic. GenAI-Bench also offers over 15,000 human ratings for leading image and video generation models such as Stable Diffusion, DALL-E 3, and Gen2.

arxiv preprint arxiv, benchmark, vqascore, (14 more...)

arXiv.org Artificial Intelligence

2404.01291

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (0.63)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.68)

Add feedback