Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis

May-27-2025, 21:41:23 GMT–Neural Information Processing Systems

Although text-to-image (T2I) models exhibit remarkable generation capabilities,they frequently fail to accurately bind semantically related objects or attributesin the input prompts; a challenge termed semantic binding. Previous approacheseither involve intensive fine-tuning of the entire T2I model or require users orlarge language models to specify generation layouts, adding complexity. In thispaper, we define semantic binding as the task of associating a given object with itsattribute, termed attribute binding, or linking it to other related sub-objects, referredto as object binding. We introduce a novel method called Token Merging (ToMe),which enhances semantic binding by aggregating relevant tokens into a singlecomposite token. This ensures that the object, its attributes and sub-objects all sharethe same cross-attention map.

artificial intelligence, token merging, training-free semantic binding, (4 more...)

Neural Information Processing Systems

May-27-2025, 21:41:23 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Vision (0.85)