TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling