Grounding Multimodal Large Language Models in Actions