Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments