A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution