Seeing isn't Hearing: Benchmarking Vision Language Models at Interpreting Spectrograms