ECVL-ROUTER: Scenario-Aware Routing for Vision-Language Models
Tang, Xin, Han, Youfang, Gou, Fangfei, Zhao, Wei, Meng, Xin, Yu, Yang, Zhang, Jinguo, Shi, Yuanchun, Wang, Yuntao, Zhang, Tengxiang
–arXiv.org Artificial Intelligence
Vision-Language Models (VLMs) excel in diverse multimodal tasks. However, user requirements vary across scenarios, which can be categorized into fast response, high-quality output, and low energy consumption. Relying solely on large models deployed in the cloud for all queries often leads to high latency and energy cost, while small models deployed on edge devices are capable of handling simpler tasks with low latency and energy cost. To fully leverage the strengths of both large and small models, we propose ECVL-ROUTER, the first scenario-aware routing framework for VLMs. Our approach introduces a new routing strategy and evaluation metrics that dynamically select the appropriate model for each query based on user requirements, maximizing overall utility. We also construct a multimodal response-quality dataset tailored for router training and validate the approach through extensive experiments. Results show that our approach successfully routes over 80% of queries to the small model while incurring less than 10% drop in problem solving probability. Vision-Language Models (VLMs), which integrate visual and textual understanding, have become crucial components in a wide range of AI applications, from robotics control to user interface navigation (Zhang et al., 2024; Shinde et al., 2025; Li et al., 2024). Moreover, a one-size-fits-all deployment strategy is suboptimal, as users increasingly expect systems that not only deliver high-quality responses but also adapt to diverse real-world scenarios with varying demands for latency, cost, and privacy. To effectively integrate the strengths of both L VLMs and SVLMs, edge-cloud collaborative routing(Y uan et al., 2025; Hao et al., 2024) is a natural fits. At its core is a lightweight model router (Ding et al., 2024; Ong et al., 2024) that inspects each query and selects an appropriate VLM. However, a general router is insufficient, routing must be scenario-aware: behaviors vary across diverse application contexts and can be configured by users or automatically inferred by scenario detection algorithms (Fifty et al., 2023; Someki et al., 2025). Existing routers are often text-centric and optimize a fixed trade-off between cost and quality, failing to adapt to multimodal, scenario-aware user needs. For example, real-time games interaction prioritizes low latency, medical diagnostics emphasizes answer quality, and mobile assistants require low energy use and strong privacy (Asgari et al., 2025).
arXiv.org Artificial Intelligence
Nov-3-2025
- Genre:
- Research Report > New Finding (0.34)
- Industry:
- Energy (0.68)
- Health & Medicine (0.48)
- Information Technology (0.46)
- Technology:
- Information Technology
- Communications > Networks (1.00)
- Artificial Intelligence
- Vision (1.00)
- Natural Language
- Large Language Model (1.00)
- Chatbot (0.94)
- Machine Learning > Neural Networks
- Deep Learning (0.68)
- Information Technology