Can Vision Language Models Understand Mimed Actions?

Open in new window