Toward Joint Language Modeling for Speech Units and Text