Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework

Open in new window