Multimodal Contextualized Semantic Parsing from Speech