AITopics | vision expert

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

Neural Information Processing SystemsMar-22-2026, 06:18:24 GMT

As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we found that there is still no single vision encoder that can dominate various image content understanding, e.g., the CLIP vision encoder leads to outstanding results on general image understanding but poor performance on document or chart content. To alleviate the bias of CLIP vision encoder, we first delve into the inherent behavior of different pre-trained vision encoders and then propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts according to the user instruction, input image, and expertise of vision experts.

artificial intelligence, large language model, natural language, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.41)

Add feedback

20ffc2b42c7de4a1960cfdadf305bbe2-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-19-2026, 04:30:58 GMT

arxiv preprint arxiv, dataset, information, (13 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > China > Beijing > Beijing (0.04)
North America > United States > New York (0.04)
(4 more...)

Genre: Research Report (0.46)

Industry: Social Sector (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

d46662aa53e78a62afd980a29e0c37ed-Paper-Conference.pdf

Neural Information Processing SystemsFeb-12-2026, 03:19:10 GMT

encoder, image-text pair, representation, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
North America > Canada > British Columbia > Vancouver (0.04)
(13 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.93)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

MoV A: Adapting Mixture of Vision Experts to Multimodal Context

Neural Information Processing SystemsNov-20-2025, 02:58:22 GMT

We conduct extensive experiments to evaluate the effectiveness of the proposed approach. Without any bells and whistles, MoV A can achieve significant performance gains over current state-of-the-art methods in a wide range of challenging multimodal benchmarks.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Country:

Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

bb0fea29f7aa6ede17e906ac6a225f34-Paper-Conference.pdf

Neural Information Processing SystemsOct-10-2025, 14:52:23 GMT

arxiv preprint arxiv, language model, vision expert, (14 more...)

Neural Information Processing Systems

Country:

Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre:

Research Report > Experimental Study (0.93)
Workflow (0.67)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Neural Information Processing SystemsOct-9-2025, 20:45:46 GMT

As the saying goes, "an image is worth a thousand words".

arxiv preprint arxiv, dataset, information, (13 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > China > Beijing > Beijing (0.04)
North America > United States > New York (0.04)
(4 more...)

Genre: Research Report (0.46)

Industry: Social Sector (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

LEO-MINI: An Efficient Multimodal Large Language Model using Conditional Token Reduction and Mixture of Multi-Modal Experts

Wang, Yimu, Azadani, Mozhgan Nasr, Sedwards, Sean, Czarnecki, Krzysztof

arXiv.org Artificial IntelligenceSep-23-2025

Redundancy of visual tokens in multi-modal large language models (MLLMs) significantly reduces their computational efficiency. Recent approaches, such as resamplers and summarizers, have sought to reduce the number of visual tokens, but at the cost of visual reasoning ability. To address this, we propose LEO-MINI, a novel MLLM that significantly reduces the number of visual tokens and simultaneously boosts visual reasoning capabilities. For efficiency, LEO-MINI incorporates CoTR, a novel token reduction module to consolidate a large number of visual tokens into a smaller set of tokens, using the similarity between visual tokens, text tokens, and a compact learnable query. For effectiveness, to scale up the model's ability with minimal computational overhead, LEO-MINI employs MMoE, a novel mixture of multi-modal experts module. MMOE employs a set of LoRA experts with a novel router to switch between them based on the input text and visual tokens instead of only using the input hidden state. MMoE also includes a general LoRA expert that is always activated to learn general knowledge for LLM reasoning. For extracting richer visual features, MMOE employs a set of vision experts trained on diverse domain-specific data. To demonstrate LEO-MINI's improved efficiency and performance, we evaluate it against existing efficient MLLMs on various benchmark vision-language tasks.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2504.04653

Country: North America > Canada (0.28)

Genre: Research Report (0.50)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

Neural Information Processing SystemsAug-19-2025, 05:39:03 GMT

Specifically, we introduce Multiway Transformer, where each block contains a pool of modality-specific experts and a shared self-attention layer.

image-text pair, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
North America > Canada > British Columbia > Vancouver (0.04)
(13 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.93)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

Neural Information Processing SystemsMay-27-2025, 14:31:53 GMT

As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we found that there is still no single vision encoder that can dominate various image content understanding, e.g., the CLIP vision encoder leads to outstanding results on general image understanding but poor performance on document or chart content. To alleviate the bias of CLIP vision encoder, we first delve into the inherent behavior of different pre-trained vision encoders and then propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts according to the user instruction, input image, and expertise of vision experts. In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge from various experts.

multimodal context, vision encoder, vision expert, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.44)

Add feedback

LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models

Azadani, Mozhgan Nasr, Riddell, James, Sedwards, Sean, Czarnecki, Krzysztof

arXiv.org Artificial IntelligenceJan-12-2025

Enhanced visual understanding serves as a cornerstone for multimodal large language models (MLLMs). Recent hybrid MLLMs incorporate a mixture of vision experts to address the limitations of using a single vision encoder and excessively long visual tokens. Despite the progress of these MLLMs, a research gap remains in effectively integrating diverse vision encoders. This work explores fusion strategies of visual tokens for hybrid MLLMs, leading to the design of LEO, a novel MLLM with a dual-branch vision encoder framework that incorporates a post-adaptation fusion strategy and adaptive tiling: for each segmented tile of the input images, LEO sequentially interleaves the visual tokens from its two vision encoders. Extensive evaluation across 13 vision-language benchmarks reveals that LEO outperforms state-of-the-art open-source MLLMs and hybrid MLLMs on the majority of tasks. Furthermore, we show that LEO can be adapted to the specialized domain of autonomous driving without altering the model architecture or training recipe, achieving competitive performance compared to existing baselines. The code and model will be publicly available.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2501.06986

Genre: Research Report (1.00)

Industry: Transportation > Ground > Road (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Filters

Collaborating Authors

vision expert

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

20ffc2b42c7de4a1960cfdadf305bbe2-Paper-Datasets_and_Benchmarks_Track.pdf

d46662aa53e78a62afd980a29e0c37ed-Paper-Conference.pdf

MoV A: Adapting Mixture of Vision Experts to Multimodal Context

bb0fea29f7aa6ede17e906ac6a225f34-Paper-Conference.pdf

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

LEO-MINI: An Efficient Multimodal Large Language Model using Conditional Token Reduction and Mixture of Multi-Modal Experts

Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models