What can human minimal videos tell us about dynamic recognition models?