A Compositional Framework for Grounding Language Inference, Generation, and Acquisition in Video

Open in new window