Hierarchical Task Learning from Language Instructions with Unified Transformers and Self-Monitoring