Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models