Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents Zihao Wang
–Neural Information Processing Systems
These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models. With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc .
Neural Information Processing Systems
Nov-19-2025, 19:43:46 GMT
- Country:
- Asia > China
- North America > United States
- California > Los Angeles County > Los Angeles (0.14)
- Genre:
- Research Report > Experimental Study (0.93)
- Industry:
- Leisure & Entertainment
- Games > Computer Games (0.51)
- Sports > Golf (0.68)
- Materials > Metals & Mining
- Diamonds (0.46)
- Leisure & Entertainment
- Technology: