deepfashion
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.05)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.05)
- Europe > Belgium > Flanders > Flemish Brabant > Leuven (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Europe > Switzerland > Zürich > Zürich (0.04)
- Europe > Germany > Saarland (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
Response to Reviewer 5
We appreciate suggestions from R6, 7, 8 and will include these in the paper. We have included most competitive methods with comparable settings to ours at the submission time. We will include the shown Algorithm 1. S sampled such that all attributes are present? However, our framework can compose features from any set S by solving Eq (10) even with missing attributes in S . Please notice that they are different.
Seeing the Abstract: Translating the Abstract Language for Vision Language Models
Talon, Davide, Girella, Federico, Liu, Ziyue, Cristani, Marco, Wang, Yiming
Natural language goes beyond dryly describing visual content. It contains rich abstract concepts to express feeling, creativity and properties that cannot be directly perceived. Yet, current research in Vision Language Models (VLMs) has not shed light on abstract-oriented language. Our research breaks new ground by uncovering its wide presence and under-estimated value, with extensive analysis. Particularly, we focus our investigation on the fashion domain, a highly-representative field with abstract expressions. By analyzing recent large-scale multimodal fashion datasets, we find that abstract terms have a dominant presence, rivaling the concrete ones, providing novel information, and being useful in the retrieval task. However, a critical challenge emerges: current general-purpose or fashion-specific VLMs are pre-trained with databases that lack sufficient abstract words in their text corpora, thus hindering their ability to effectively represent abstract-oriented language. We propose a training-free and model-agnostic method, Abstract-to-Concrete Translator (ACT), to shift abstract representations towards well-represented concrete ones in the VLM latent space, using pre-trained models and existing multimodal databases. On the text-to-image retrieval task, despite being training-free, ACT outperforms the fine-tuned VLMs in both same- and cross-dataset settings, exhibiting its effectiveness with a strong generalization capability. Moreover, the improvement introduced by ACT is consistent with various VLMs, making it a plug-and-play solution.
- Europe > Spain (0.04)
- Europe > Italy > Piedmont > Turin Province > Turin (0.04)
- Africa > Central African Republic > Ombella-M'Poko > Bimbo (0.04)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)
Review for NeurIPS paper: Human Parsing Based Texture Transfer from Single Image to 3D Human via Cross-View Consistency
Weaknesses: - It's unclear if there is significant improvement over RSTG[33] from Figure 5. In particular, the results are only compared from the frontal view, the approach should be compared with [33] that shows multiple views of the image. The results of [33] is not compared on DeepFashion. In fact, CMR looks a lot worse perceptually than RSTG in Figure 5(a), however there is a significant difference in mask-SSIM which is a bit peculiar. For human body shpaes, the simple spherical UV mapping introduces quite a significant distortion.