SPES: Spectrogram Perturbation for Explainable Speech-to-Text Generation