Rare Text Semantics Were Always There in Your Diffusion Transformer

Jun-14-2026, 08:17:36 GMT–Neural Information Processing Systems

Starting from flow-and diffusion-based transformers, Multi-modal Diffusion Transformers (MM-DiTs) have reshaped text-to-vision generation, gaining acclaim for exceptional visual fidelity. As these models advance, users continually push the boundary with imaginative or rare prompts, which advanced models still falter in generating, since their concepts are often too scarce to leave a strong imprint during pre-training. In this paper, we propose a simple yet effective intervention that surfaces rare semantics inside MM-DiTs without additional training steps, data, denoising-time optimization, or reliance on external modules (e.g., large language models).

artificial intelligence, natural language, proceedings, (3 more...)

Neural Information Processing Systems

Jun-14-2026, 08:17:36 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language (0.77)