Visual cognition in multimodal large language models