Video sentence grounding with temporally global textual knowledge