Can Large Vision-Language Models Understand Multimodal Sarcasm?

Open in new window