Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents Zihao Wang

Neural Information Processing Systems 

These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models. With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc .

Similar Docs  Excel Report  more

TitleSimilaritySource
None found