textual concept
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > Middle East > Israel (0.05)
- Europe > Poland (0.04)
- (2 more...)
- North America > Canada (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (0.90)
- Information Technology > Artificial Intelligence > Vision > Image Understanding (0.36)
- Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > Middle East > Israel (0.05)
- Europe > Poland (0.04)
- (2 more...)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Vision > Image Understanding (0.35)
Table 1 Evaluation of the state of the art model
Table 2: The accuracy on the VQA v2.0 test set. We thank all the reviewers for the helpful comments. Q1: How the paper's contribution relates to the current SOT A? SGAE is a rather complicated scene-graph based method specific to image captioning. The results with current SOT A + MIA will be stated more clearly in the paper. Q2: How to use MIA on the baseline systems (i.e., how is MIA applied to image captioning For the settings, we have listed them in the supplementary materials.
Imagine for Me: Creative Conceptual Blending of Real Images and Text via Blended Attention
Cho, Wonwoong, Zhang, Yanxia, Chen, Yan-Ying, Inouye, David I.
Blending visual and textual concepts into a new visual concept is a unique and powerful trait of human beings that can fuel creativity. However, in practice, cross-modal conceptual blending for humans is prone to cognitive biases, like design fixation, which leads to local minima in the design space. In this paper, we propose a T2I diffusion adapter "IT-Blender" that can automate the blending process to enhance human creativity. Prior works related to cross-modal conceptual blending are limited in encoding a real image without loss of details or in disentangling the image and text inputs. To address these gaps, IT-Blender leverages pretrained diffusion models (SD and FLUX) to blend the latent representations of a clean reference image with those of the noisy generated image. Combined with our novel blended attention, IT-Blender encodes the real reference image without loss of details and blends the visual concept with the object specified by the text in a disentangled way. Our experiment results show that IT-Blender outperforms the baselines by a large margin in blending visual and textual concepts, shedding light on the new application of image generative models to augment human creativity.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)