Efficient Test-Time Scaling for Small Vision-Language Models
Kaya, Mehmet Onurcan, Elliott, Desmond, Papadopoulos, Dim P.
–arXiv.org Artificial Intelligence
Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.
arXiv.org Artificial Intelligence
Oct-7-2025
- Country:
- Asia
- Europe
- Austria > Vienna (0.14)
- Denmark > Capital Region
- Copenhagen (0.04)
- France (0.04)
- Germany (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Italy > Tuscany
- Florence (0.04)
- North America
- Canada > Ontario
- Toronto (0.04)
- United States > District of Columbia
- Washington (0.04)
- Canada > Ontario
- South America > Colombia
- Meta Department > Villavicencio (0.04)
- Genre:
- Research Report > New Finding (1.00)
- Technology: