What is needed for simple spatial language capabilities in VQA?