Grammar Induction from Visual, Speech and Text