Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking