Multi-Scale Attention for Audio Question Answering