Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

Jun-2-2025, 01:48:04 GMT–Neural Information Processing Systems

Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities compared to CLIP and T5-series models. However, the paradigm for utilizing current advanced LLMs in textto-image diffusion models remains to be explored. We observed an unusual phenomenon: directly using a large language model as the prompt encoder significantly degrades the prompt-following ability in image generation. We identified two main obstacles behind this issue. One is the misalignment between the next token prediction training in LLM and the requirement for discriminative prompt features in diffusion models.

diffusion model, large language model, machine learning, (18 more...)

Neural Information Processing Systems

Jun-2-2025, 01:48:04 GMT

Conferences PDF

Add feedback

Country:
- Asia > China (0.14)
- Europe > Germany (0.14)

Genre:
- Research Report > Experimental Study (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)