subject region
StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization
Gaur, Gopalji, Zolfaghari, Mohammadreza, Brox, Thomas
Generating a coherent sequence of images that tells a visual story, using text-to-image diffusion models, often faces the critical challenge of maintaining subject consistency across all story scenes. Existing approaches, which typically rely on fine-tuning or retraining models, are computationally expensive, time-consuming, and often interfere with the model's pre-existing capabilities. In this paper, we follow a training-free approach and propose an efficient consistent-subject-generation method. This approach works seamlessly with pre-trained diffusion models by introducing masked cross-image attention sharing to dynamically align subject features across a batch of images, and Regional Feature Harmonization to refine visually similar details for improved subject consistency. Experimental results demonstrate that our approach successfully generates visually consistent subjects across a variety of scenarios while maintaining the creative abilities of the diffusion model.
- Europe > Germany > Baden-Württemberg > Freiburg (0.40)
- Asia > Middle East > Saudi Arabia > Northern Borders Province > Arar (0.05)
- North America > Canada > British Columbia > Vancouver (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Detail++: Training-Free Detail Enhancer for Text-to-Image Diffusion Models
Chen, Lifeng, Wang, Jiner, Pan, Zihao, Zhu, Beier, Yang, Xiaofeng, Zhang, Chi
Recent advances in text-to-image (T2I) generation have led to impressive visual results. However, these models still face significant challenges when handling complex prompt, particularly those involving multiple subjects with distinct attributes. Inspired by the human drawing process, which first outlines the composition and then incrementally adds details, we propose Detail++, a training-free framework that introduces a novel Progressive Detail Injection (PDI) strategy to address this limitation. Specifically, we decompose a complex prompt into a sequence of simplified sub-prompts, guiding the generation process in stages. This staged generation leverages the inherent layout-controlling capacity of self-attention to first ensure global composition, followed by precise refinement. To achieve accurate binding between attributes and corresponding subjects, we exploit cross-attention mechanisms and further introduce a Centroid Alignment Loss at test time to reduce binding noise and enhance attribute consistency. Extensive experiments on T2I-CompBench and a newly constructed style composition benchmark demonstrate that Detail++ significantly outperforms existing methods, particularly in scenarios involving multiple objects and complex stylistic conditions.
VitaGlyph: Vitalizing Artistic Typography with Flexible Dual-branch Diffusion Models
Feng, Kailai, Zhang, Yabo, Yu, Haodong, Ji, Zhilong, Bai, Jinfeng, Zhang, Hongzhi, Zuo, Wangmeng
Artistic typography is a technique to visualize the meaning of input character in an imaginable and readable manner. With powerful text-to-image diffusion models, existing methods directly design the overall geometry and texture of input character, making it challenging to ensure both creativity and legibility. In this paper, we introduce a dual-branch and training-free method, namely VitaGlyph, enabling flexible artistic typography along with controllable geometry change to maintain the readability. The key insight of VitaGlyph is to treat input character as a scene composed of Subject and Surrounding, followed by rendering them under varying degrees of geometry transformation. The subject flexibly expresses the essential concept of input character, while the surrounding enriches relevant background without altering the shape. Specifically, we implement VitaGlyph through a three-phase framework: (i) Knowledge Acquisition leverages large language models to design text descriptions of subject and surrounding. (ii) Regional decomposition detects the part that most matches the subject description and divides input glyph image into subject and surrounding regions. (iii) Typography Stylization firstly refines the structure of subject region via Semantic Typography, and then separately renders the textures of Subject and Surrounding regions through Controllable Compositional Generation. Experimental results demonstrate that VitaGlyph not only achieves better artistry and readability, but also manages to depict multiple customize concepts, facilitating more creative and pleasing artistic typography generation. Our code will be made publicly at https://github.com/Carlofkl/VitaGlyph.
- Asia > Middle East > Jordan (0.04)
- Asia > China > Heilongjiang Province > Harbin (0.04)
- Transportation (0.68)
- Information Technology (0.46)
- Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (0.46)
- Education (0.46)
Aesthetic Guideline Driven Photography by Robots
Gadde, Raghudeep (International Institute of Information Technology - Hyderabad) | Karlapalem, Kamalakar (International Institute of Information Technology - Hyderabad)
Robots depend on captured images for perceiving the environment. A robot can replace a human in capturing quality photographs for publishing. In this paper, we employ an iterative photo capture by robots (by repositioning itself) to capture good quality photographs. Our image quality assessment approach is based on few high level features of the image combined with some of the aesthetic guidelines of professional photography. Our system can also be used in web image search applications to rank images. We test our quality assessment approach on a large and diversified dataset and our system is able to achieve a classification accuracy of 79%. We assess the aesthetic error in the captured image and estimate the change required in orientation of the robot to retake an aesthetically better photograph. Our experiments are conducted on NAO robot with no stereo vision. The results demonstrate that our system can be used to capture professional photographs which are in accord with the human professional photography.
- North America > United States > Indiana (0.04)
- Asia > India > Telangana > Hyderabad (0.04)