Modeling Human Visual Motion Processing with Trainable Motion Energy Sensing and a Self-attention Network