Infer Human's Intentions Before Following Natural Language Instructions