Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning

Open in new window