VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models

Open in new window