Hierarchical Representations for Spatio-Temporal Visual Attention Modeling and Understanding