Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition