Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison