Beyond Generation: Multi-Hop Reasoning for Factual Accuracy in Vision-Language Models