Jina-VLM: Small Multilingual Vision Language Model

Koukounas, Andreas, Mastrapas, Georgios, Hönicke, Florian, Eslami, Sedigheh, Roncari, Guillaume, Martens, Scott, Xiao, Han

Dec-5-2025–arXiv.org Artificial Intelligence

We present jina-vlm, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. The model achieves leading results on standard VQA benchmarks and multilingual evaluations while preserving competitive text-only performance. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm. Vision-language models (VLMs) combine pretrained vision encoders with large language models to tackle tasks requiring joint visual and textual understanding (Alayrac et al., 2022; Liu et al., 2023). Recent VLMs have achieved strong results on visual question answering (VQA), OCR, and multimodal reasoning.

arxiv preprint arxiv, large language model, natural language, (12 more...)

arXiv.org Artificial Intelligence

Dec-5-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > New Finding (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language > Large Language Model (0.36)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found