When End-to-End is Overkill: Rethinking Cascaded Speech-to-Text Translation