Are Multimodal Large Language Models Pragmatically Competent Listeners in Simple Reference Resolution Tasks?

Open in new window