A Simple Baseline for Audio-Visual Scene-Aware Dialog