VQAScore: Evaluating and improving vision-language generative models

Nov-6-2024, 12:11:26 GMT–AIHub

Text-to-image/video models like Midjourney, Imagen3, Stable Diffusion, and Sora can generate aesthetic, photo-realistic visuals from natural language prompts, for example, given "Several giant woolly mammoths approach, treading through a snowy meadow…", Sora generates: But how do we know if these models generate what we desire? For example, if the prompt is "The brown dog chases the black dog around a tree", how can we tell if the model shows the dogs "chasing around a tree" rather than "playing in a backyard"? More generally, how should we evaluate these generative models? While humans can easily judge whether a generated image aligns with a prompt, large-scale human evaluation is costly. To address this, we introduce a new evaluation metric (VQAScore) and benchmark dataset (GenAI-Bench) [Lin et al., ECCV 2024] for automated evaluation of text-to-visual generative models.

artificial intelligence, machine learning, natural language, (20 more...)

AIHub

Nov-6-2024, 12:11:26 GMT

News Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Generation (0.82)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)