Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs

Open in new window