Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents Zihao Wang

Feb-16-2026, 08:38:05 GMT–Neural Information Processing Systems

These additional behavior tokens will be augmented to the vocabulary of pretrained Multimodal Language Models. With this encoder, we then pack long-term multimodal interactions involving task instructions, memories, thoughts, observations, textual responses, behavior trajectories, etc .

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Feb-16-2026, 08:38:05 GMT

Conferences PDF

Add feedback

Country:
- North America > United States
  - California > Los Angeles County > Los Angeles (0.14)
- Asia > China
  - Beijing > Beijing (0.04)

Genre:
- Research Report > Experimental Study (0.93)

Industry:
- Materials > Metals & Mining
  - Diamonds (0.46)
- Leisure & Entertainment > Games
  - Computer Games (0.73)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Agents (0.93)
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.68)

Duplicate Docs Excel Report

Title
85f1225db986e629289f402c46eff1a4-Paper-Conference.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found