Grounding Multimodal Large Language Models in Actions

Open in new window