Efficient Test-Time Scaling for Small Vision-Language Models

Kaya, Mehmet Onurcan, Elliott, Desmond, Papadopoulos, Dim P.

Oct-7-2025–arXiv.org Artificial Intelligence

Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Oct-7-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.92)
- Europe > Austria (0.27)

Genre:
- Research Report > New Finding (1.00)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Representation & Reasoning (1.00)
    - Natural Language
      - Large Language Model (1.00)
      - Text Processing (0.67)
      - Chatbot (0.67)
    - Machine Learning > Neural Networks
      - Deep Learning (0.92)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found