Visual Programming for Step-by-Step Text-to-Image Generation and Evaluation
–Neural Information Processing Systems
As large language models have demonstrated impressive performance in many domains, recent works have adopted language models (LMs) as controllers of visual modules for vision-and-language tasks. While existing work focuses on equipping LMs with visual understanding, we propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation. First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation. We employ an LM to handle the first two steps (object/count generation and layout generation), by finetuning it on text-layout pairs. Our step-by-step T2I generation framework provides stronger spatial control than end-to-end models, the dominant approach for this task.
Neural Information Processing Systems
Oct-9-2024, 21:27:12 GMT
- Technology:
- Information Technology
- Artificial Intelligence
- Machine Learning > Neural Networks (0.63)
- Natural Language (0.83)
- Vision (1.00)
- Software > Programming Languages (0.67)
- Artificial Intelligence
- Information Technology