Spatio-Temporal Attention Pooling for Audio Scene Classification