Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks