Goto

Collaborating Authors

 deepfashion








Response to Reviewer 5

Neural Information Processing Systems

We appreciate suggestions from R6, 7, 8 and will include these in the paper. We have included most competitive methods with comparable settings to ours at the submission time. We will include the shown Algorithm 1. S sampled such that all attributes are present? However, our framework can compose features from any set S by solving Eq (10) even with missing attributes in S . Please notice that they are different.



Seeing the Abstract: Translating the Abstract Language for Vision Language Models

Talon, Davide, Girella, Federico, Liu, Ziyue, Cristani, Marco, Wang, Yiming

arXiv.org Artificial Intelligence

Natural language goes beyond dryly describing visual content. It contains rich abstract concepts to express feeling, creativity and properties that cannot be directly perceived. Yet, current research in Vision Language Models (VLMs) has not shed light on abstract-oriented language. Our research breaks new ground by uncovering its wide presence and under-estimated value, with extensive analysis. Particularly, we focus our investigation on the fashion domain, a highly-representative field with abstract expressions. By analyzing recent large-scale multimodal fashion datasets, we find that abstract terms have a dominant presence, rivaling the concrete ones, providing novel information, and being useful in the retrieval task. However, a critical challenge emerges: current general-purpose or fashion-specific VLMs are pre-trained with databases that lack sufficient abstract words in their text corpora, thus hindering their ability to effectively represent abstract-oriented language. We propose a training-free and model-agnostic method, Abstract-to-Concrete Translator (ACT), to shift abstract representations towards well-represented concrete ones in the VLM latent space, using pre-trained models and existing multimodal databases. On the text-to-image retrieval task, despite being training-free, ACT outperforms the fine-tuned VLMs in both same- and cross-dataset settings, exhibiting its effectiveness with a strong generalization capability. Moreover, the improvement introduced by ACT is consistent with various VLMs, making it a plug-and-play solution.


Review for NeurIPS paper: Human Parsing Based Texture Transfer from Single Image to 3D Human via Cross-View Consistency

Neural Information Processing Systems

Weaknesses: - It's unclear if there is significant improvement over RSTG[33] from Figure 5. In particular, the results are only compared from the frontal view, the approach should be compared with [33] that shows multiple views of the image. The results of [33] is not compared on DeepFashion. In fact, CMR looks a lot worse perceptually than RSTG in Figure 5(a), however there is a significant difference in mask-SSIM which is a bit peculiar. For human body shpaes, the simple spherical UV mapping introduces quite a significant distortion.