Easy Regional Contrastive Learning of Expressive Fashion Representations
–Neural Information Processing Systems
When learning vision-language models (VLM) for the fashion domain, most existing works design new architectures from vanilla BERT with additional objectives, or perform dense multi-task learning with fashion-specific tasks. Though progress has been made, their architecture or objectives are often intricate and the extendibility is limited. By contrast, with simple architecture (comprising only two unimodal encoders) and just the contrastive objective, popular pre-trained VL models (e.g., CLIP) achieve superior performance in general domains, which are further easily extended to downstream tasks. However, inheriting such benefits of CLIP in the fashion domain is non-trivial in the presence of the notable domain gap. Empirically, we find that directly finetuning on fashion data leads CLIP to frequently ignore minor yet important details such as logos and composition, which are critical in fashion tasks such as retrieval and captioning. In this work, to maintain CLIP's simple architecture and objective while explicitly attending to fashion details, we propose E
Neural Information Processing Systems
Mar-19-2025, 02:28:16 GMT
- Country:
- North America > United States > Virginia > Albemarle County > Charlottesville (0.14)
- Genre:
- Research Report > Experimental Study (0.93)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (0.67)
- Natural Language (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence