Where do Large Vision-Language Models Look at when Answering Questions?

Open in new window