Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation