Third-Person Visual Imitation Learning via Decoupled Hierarchical Controller