Video Referring Expression Comprehension via Transformer with Content-conditioned Query