Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework