open-ended prompt
Visual Programming for Text to Image Generation and Evaluation
As large language models have demonstrated impressive performance in many domains, recent works have adopted language models (LMs) as controllers of visual modules for vision-and-language tasks. While existing work focuses on equipping LMs with visual understanding, we propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation. First, we introduce VPGEN, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation. We employ an LM to handle the first two steps (object/count generation and layout generation), by finetuning it on textlayout pairs. Our step-by-step T2I generation framework provides stronger spatial control than end-to-end models, the dominant approach for this task.
Biden administration launches AI safety initiative, calling for public input on standards
Fox News Flash top headlines are here. Check out what's clicking on Foxnews.com. The Biden administration said on Tuesday it was taking the first step toward writing key standards and guidance for the safe deployment of generative artificial intelligence and how to test and safeguard systems. The Commerce Department's National Institute of Standards and Technology (NIST) said it was seeking public input by Feb. 2 for conducting key testing crucial to ensuring the safety of AI systems. Commerce Secretary Gina Raimondo said the effort was prompted by President Joe Biden's October executive order on AI and aimed at developing "industry standards around AI safety, security, and trust that will enable America to continue leading the world in the responsible development and use of this rapidly evolving technology."
Visual Programming for Text-to-Image Generation and Evaluation
Cho, Jaemin, Zala, Abhay, Bansal, Mohit
As large language models have demonstrated impressive performance in many domains, recent works have adopted language models (LMs) as controllers of visual modules for vision-and-language tasks. While existing work focuses on equipping LMs with visual understanding, we propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation. First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation. We employ an LM to handle the first two steps (object/count generation and layout generation), by finetuning it on text-layout pairs. Our step-by-step T2I generation framework provides stronger spatial control than end-to-end models, the dominant approach for this task. Furthermore, we leverage the world knowledge of pretrained LMs, overcoming the limitation of previous layout-guided T2I works that can only handle predefined object classes. We demonstrate that our VPGen has improved control in counts/spatial relations/scales of objects than state-of-the-art T2I generation models. Second, we introduce VPEval, an interpretable and explainable evaluation framework for T2I generation based on visual programming. Unlike previous T2I evaluations with a single scoring model that is accurate in some skills but unreliable in others, VPEval produces evaluation programs that invoke a set of visual modules that are experts in different skills, and also provides visual+textual explanations of the evaluation results. Our analysis shows that VPEval provides a more human-correlated evaluation for skill-specific and open-ended prompts than widely used single model-based evaluation. We hope that our work encourages future progress on interpretable/explainable generation and evaluation for T2I models.