VQAScore: Evaluating and improving vision-language generative models
Text-to-image/video models like Midjourney, Imagen3, Stable Diffusion, and Sora can generate aesthetic, photo-realistic visuals from natural language prompts, for example, given "Several giant woolly mammoths approach, treading through a snowy meadow…", Sora generates: But how do we know if these models generate what we desire? For example, if the prompt is "The brown dog chases the black dog around a tree", how can we tell if the model shows the dogs "chasing around a tree" rather than "playing in a backyard"? More generally, how should we evaluate these generative models? While humans can easily judge whether a generated image aligns with a prompt, large-scale human evaluation is costly. To address this, we introduce a new evaluation metric (VQAScore) and benchmark dataset (GenAI-Bench) [Lin et al., ECCV 2024] for automated evaluation of text-to-visual generative models.
Nov-6-2024, 12:11:26 GMT
- Technology: