Visual Programming for Step-by-Step Text-to-Image Generation and Evaluation

Oct-9-2024, 21:27:12 GMT–Neural Information Processing Systems

As large language models have demonstrated impressive performance in many domains, recent works have adopted language models (LMs) as controllers of visual modules for vision-and-language tasks. While existing work focuses on equipping LMs with visual understanding, we propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation. First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation. We employ an LM to handle the first two steps (object/count generation and layout generation), by finetuning it on text-layout pairs. Our step-by-step T2I generation framework provides stronger spatial control than end-to-end models, the dominant approach for this task.

layout generation, step-by-step text-to-image generation and evaluation, visual programming, (5 more...)

Neural Information Processing Systems

Oct-9-2024, 21:27:12 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology
  - Software > Programming Languages (0.67)
  - Artificial Intelligence
    - Vision (1.00)
    - Natural Language (0.83)
    - Machine Learning > Neural Networks (0.63)