Learning Multimodal VAEs through Mutual Supervision