Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis
–Neural Information Processing Systems
Although text-to-image (T2I) models exhibit remarkable generation capabilities, they frequently fail to accurately bind semantically related objects or attributes in the input prompts; a challenge termed semantic binding. Previous approaches either involve intensive fine-tuning of the entire T2I model or require users or large language models to specify generation layouts, adding complexity. In this paper, we define semantic binding as the task of associating a given object with its attribute, termed attribute binding, or linking it to other related sub-objects, referred to as object binding. We introduce a novel method called Token Merging (ToMe), which enhances semantic binding by aggregating relevant tokens into a single composite token. This ensures that the object, its attributes and sub-objects all share the same cross-attention map.
Neural Information Processing Systems
Mar-27-2025, 15:42:12 GMT
- Country:
- Europe (0.28)
- Genre:
- Research Report
- Experimental Study (0.93)
- Promising Solution (1.00)
- Research Report
- Industry:
- Information Technology (0.46)
- Technology: