Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters