Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction

Open in new window