Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning
–Neural Information Processing Systems
Current text-to-image diffusion generation typically employs complete-text conditioning. Due to the intricate syntax, diffusion transformers (DiTs) inherently suffer from a comprehension defect of complete-text captions. One-fly complete-text input either overlooks critical semantic details or causes semantic confusion by simultaneously modeling diverse semantic primitive types. To mitigate this defect of DiTs, we propose a novel split-text conditioning framework named DiT-ST. This framework converts a complete-text caption into a split-text caption, a collection of simplified sentences, to explicitly express various semantic primitives and their interconnections. The split-text caption is then injected into different denoising stages of DiT-ST in a hierarchical and incremental manner.
Neural Information Processing Systems
Jun-15-2026, 21:40:49 GMT
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.92)
- Research Report
- Industry:
- Information Technology (0.92)
- Health & Medicine (0.67)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Representation & Reasoning (1.00)
- Machine Learning > Neural Networks (0.68)
- Natural Language > Text Processing (0.68)
- Information Technology > Artificial Intelligence