FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering

Jun-19-2026, 13:01:02 GMT–Neural Information Processing Systems

While Multimodal Large Language Models (MLLMs) offer strong perception and reasoning capabilities for image-text input, Visual Question Answering (VQA) focusing on small image details still remains a challenge. Although visual cropping techniques seem promising, recent approaches have several limitations: the need for task-specific fine-tuning, low efficiency due to uninformed exhaustive search, or incompatibility with efficient attention implementations. We address these shortcomings by proposing a training-free visual cropping method, dubbed FOCUS, that leverages MLLM-internal representations to guide the search for the most relevant image region. This is accomplished in four steps: first, we identify the target object(s) in the VQA prompt; second, we compute an object relevance map using the key-value (KV) cache; third, we propose and rank relevant image regions based on the map; and finally, we perform the fine-grained VQA task using the topranked region.

llava-1, natural language, question answering, (20 more...)

Neural Information Processing Systems

Jun-19-2026, 13:01:02 GMT

Conferences PDF

Add feedback

Country:
- North America > United States (1.00)
- Europe (1.00)

Genre:
- Research Report > Experimental Study (1.00)
- Overview (0.92)

Industry:
- Transportation (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Search (0.87)
  - Natural Language > Question Answering (0.70)
  - Cognitive Science > Problem Solving (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found