Masked Vision-Language Transformer in Fashion

Ji, Ge-Peng, Zhuge, Mingcheng, Gao, Dehong, Fan, Deng-Ping, Sakaridis, Christos, Van Gool, Luc

Oct-26-2022–arXiv.org Artificial Intelligence

We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize vision transformer architecture for replacing the BERT in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner Kaleido-BERT. Code is made available at https://github.com/GewelsJI/MVLT.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-26-2022

arXiv.org PDF

Add feedback

Country:
- North America
  - United States
    - Washington > Clark County
      - Vancouver (0.04)
    - Tennessee > Davidson County
      - Nashville (0.04)
    - New York > New York County
      - New York City (0.04)
    - Nevada > Clark County
      - Las Vegas (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - Florida > Miami-Dade County
      - Miami (0.04)
  - Canada > Quebec
    - Montreal (0.04)
- Europe
  - France (0.04)
  - Austria > Vienna (0.04)
  - United Kingdom > Scotland
    - City of Glasgow > Glasgow (0.04)
  - Switzerland > Zürich
    - Zürich (0.14)
  - Portugal > Lisbon
    - Lisbon (0.04)
  - Italy > Veneto
    - Venice (0.04)
  - Germany > Bavaria
    - Upper Bavaria > Munich (0.04)
- Asia > China
  - Hong Kong (0.04)
  - Zhejiang Province > Hangzhou (0.04)

Genre:
- Research Report > New Finding (0.34)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Natural Language (1.00)
    - Machine Learning > Neural Networks
      - Deep Learning (0.48)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found