Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning