Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs