Instruction-Following Agents with Multimodal Transformer