Masked Vision-Language Transformer in Fashion
Ji, Ge-Peng, Zhuge, Mingcheng, Gao, Dehong, Fan, Deng-Ping, Sakaridis, Christos, Van Gool, Luc
–arXiv.org Artificial Intelligence
We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize vision transformer architecture for replacing the BERT in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner Kaleido-BERT. Code is made available at https://github.com/GewelsJI/MVLT.
arXiv.org Artificial Intelligence
Oct-26-2022
- Country:
- Asia > China
- Hong Kong (0.04)
- Zhejiang Province > Hangzhou (0.04)
- Europe
- Austria > Vienna (0.04)
- France (0.04)
- Germany > Bavaria
- Upper Bavaria > Munich (0.04)
- Italy > Veneto
- Venice (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- Switzerland > Zürich
- Zürich (0.14)
- United Kingdom > Scotland
- City of Glasgow > Glasgow (0.04)
- North America
- Canada > Quebec
- Montreal (0.04)
- United States
- Florida > Miami-Dade County
- Miami (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.04)
- Nevada > Clark County
- Las Vegas (0.04)
- New York > New York County
- New York City (0.04)
- Tennessee > Davidson County
- Nashville (0.04)
- Washington > Clark County
- Vancouver (0.04)
- Florida > Miami-Dade County
- Canada > Quebec
- Asia > China
- Genre:
- Research Report > New Finding (0.34)
- Technology: