Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition