Reframing linguistic bootstrapping as joint inference using visually-grounded grammar induction models