A Vision-free Baseline for Multimodal Grammar Induction