Distilling an End-to-End Voice Assistant Without Instruction Training Data