A Textless Metric for Speech-to-Speech Comparison