What's "up" with vision-language models? Investigating their struggle with spatial reasoning