See then Tell: Enhancing Key Information Extraction with Vision Grounding