Pre-trained language models for music captioning and query response