Progressive Compositionality In Text-to-Image Generative Models

Han, Xu, Jin, Linghao, Liu, Xiaofeng, Liang, Paul Pu

Oct-22-2024–arXiv.org Artificial Intelligence

Despite the impressive text-to-image (T2I) synthesis capabilities of diffusion models, they often struggle to understand compositional relationships between objects and attributes, especially in complex settings. Existing solutions have tackled these challenges by optimizing the cross-attention mechanism or learning from the caption pairs with minimal semantic changes. However, can we generate high-quality complex contrastive images that diffusion models can directly discriminate based on visual representations? These pairs feature minimal visual discrepancies and cover a wide range of attribute categories, especially complex and natural scenarios. To learn effectively from these error cases, i.e., hard negative images, we propose E Through extensive experiments across a wide range of compositional scenarios, we showcase the effectiveness of our proposed framework on compositional T2I benchmarks. The rapid advancement of text-to-image generative models (Saharia et al., 2022; Ramesh et al., 2022) has revolutionized the field of image synthesis, driving significant progress in various applications such as image editing (Brooks et al., 2023; Zhang et al., 2024), video generation (Brooks et al., 2024) and medical imaging (Han et al., 2024a). Common issues include incorrect attribute binding, miscounting, and flawed object relationships as shown in Figure 1. For example, when given the prompt "a red motorcycle and a yellow door", the model might incorrectly bind the colors to the objects, resulting in a yellow motorcycle. Recent progress focuses on optimizing the attention mechanism within diffusion models to better capture the semantic information conveyed by input text prompts (Agarwal et al., 2023; Chefer et al., 2023; Pandey et al., 2023).

artificial intelligence, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Oct-22-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.93)

Genre:
- Research Report (0.85)

Industry:
- Health & Medicine > Diagnostic Medicine
  - Imaging (0.34)
- Leisure & Entertainment > Sports (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.72)
  - Natural Language > Text Processing (0.66)
  - Vision (1.00)