Look, Recite, Then Answer: Enhancing VLM Performance via Self-Generated Knowledge Hints

Dec-4-2025–arXiv.org Artificial Intelligence

Vision - Language Models (VLMs) exhibit significant performance plateaus in specialized domains like precision agriculture, primarily due to "Reasoning - Driven Hallucination" where linguistic priors override visual perception. A key bottleneck is the "Modalit y Gap": visual embeddings fail to reliably activate the fine - grained expert knowledge already encoded in model parameters. We propose "Look, Recite, Then Answer," a parameter - efficient framework that enhances VLMs via self - generated knowledge hints while keeping backbone models frozen. The framework decouples inference into three stages: (1) Look generates objective visual des criptions and candidate sets; (2) Recite employs a lightweight 1.7B router to transform visual cues into targeted queries that trigger candidate - specific parametric knowledge; (3) Answer performs parallel evidence alignment between descriptions and recited knowledge to select the most consistent label. On AgroBench, our method achieves state - of - the - art results, improving Weed Identification accuracy by 23. 52 % over Qwen2 - VL - 72B and surpassing GPT - 4o without external search overhead. Vision - Language Models (VLMs) have advanced by synergizing visual representations with linguistic reasoning. Empirical analyses fro m AgroBench reveal a critical failure mode: "Reasoning - Driven Hallucination." (Shinoda et al., 2025) .

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Dec-4-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.64)

Industry:
- Food & Agriculture > Agriculture (1.00)
- Health & Medicine (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)