Joint Audio and Speech Understanding