Look, Recite, Then Answer: Enhancing VLM Performance via Self-Generated Knowledge Hints
–arXiv.org Artificial Intelligence
Vision - Language Models (VLMs) exhibit significant performance plateaus in specialized domains like precision agriculture, primarily due to "Reasoning - Driven Hallucination" where linguistic priors override visual perception. A key bottleneck is the "Modalit y Gap": visual embeddings fail to reliably activate the fine - grained expert knowledge already encoded in model parameters. We propose "Look, Recite, Then Answer," a parameter - efficient framework that enhances VLMs via self - generated knowledge hints while keeping backbone models frozen. The framework decouples inference into three stages: (1) Look generates objective visual des criptions and candidate sets; (2) Recite employs a lightweight 1.7B router to transform visual cues into targeted queries that trigger candidate - specific parametric knowledge; (3) Answer performs parallel evidence alignment between descriptions and recited knowledge to select the most consistent label. On AgroBench, our method achieves state - of - the - art results, improving Weed Identification accuracy by 23. 52 % over Qwen2 - VL - 72B and surpassing GPT - 4o without external search overhead. Vision - Language Models (VLMs) have advanced by synergizing visual representations with linguistic reasoning. Empirical analyses fro m AgroBench reveal a critical failure mode: "Reasoning - Driven Hallucination." (Shinoda et al., 2025) .
arXiv.org Artificial Intelligence
Dec-4-2025
- Genre:
- Research Report (0.64)
- Industry:
- Food & Agriculture > Agriculture (1.00)
- Health & Medicine (0.68)
- Technology: