Transformer-based encoder-encoder architecture for Spoken Term Detection