MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

Chu, Xiangxiang, Qiao, Limeng, Zhang, Xinyu, Xu, Shuang, Wei, Fei, Yang, Yang, Sun, Xiaofei, Hu, Yiming, Lin, Xinyang, Zhang, Bo, Shen, Chunhua

Feb-6-2024–arXiv.org Artificial Intelligence

We introduce MobileVLM V2, a family of significantly improved vision language models upon MobileVLM, which proves that a delicate orchestration of novel architectural design, an improved training scheme tailored for mobile VLMs, and rich high-quality dataset curation can substantially benefit VLMs' performance. Specifically, MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared with much larger VLMs at the 3B scale. Notably, our 3B model outperforms a large variety of VLMs at the 7B+ scale. Our models will be released at https://github.com/Meituan-AutoML/MobileVLM .

arxiv preprint arxiv, language model, zhang, (14 more...)

arXiv.org Artificial Intelligence

Feb-6-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Arkansas (0.04)
  - New Mexico (0.04)
  - Nebraska (0.04)
  - Montana (0.04)
  - Indiana > Marion County
    - Lawrence (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - China > Liaoning Province
    - Dalian (0.04)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (0.94)
  - Machine Learning > Neural Networks
    - Deep Learning (0.94)