What is needed for simple spatial language capabilities in VQA?

Open in new window