VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
–Neural Information Processing Systems
Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token reduction, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving.
Neural Information Processing Systems
Jun-19-2026, 11:02:27 GMT
- Genre:
- Research Report
- New Finding (1.00)
- Experimental Study (1.00)
- Research Report
- Industry:
- Information Technology (0.45)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Representation & Reasoning (1.00)
- Cognitive Science (1.00)
- Natural Language
- Large Language Model (1.00)
- Chatbot (0.93)
- Machine Learning
- Reinforcement Learning (1.00)
- Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence