Humanoid Locomotion as Next Token Prediction
–Neural Information Processing Systems
We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor sequences. To account for the multimodal nature of the data, we perform prediction in a modality-aligned way, and for each input token predict the next token from the same modality. This general formulation enables us to leverage data with missing modalities, such as videos without actions. We train our model on a dataset of sequences from a prior neural network policy, a model-based controller, motion capture, and YouTube videos of humans. We show that our model enables a real humanoid robot to walk in San Francisco zero-shot. Our model can transfer to the real world even when trained on only 27 hours of walking data, and can generalize to commands not seen during training. These findings suggest a promising path toward learning challenging real-world control tasks by generative modeling of sensorimotor sequences.
Neural Information Processing Systems
May-25-2025, 09:27:42 GMT
- Country:
- North America > United States > California > San Francisco County > San Francisco (0.25)
- Genre:
- Research Report > Experimental Study (0.93)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning > Neural Networks (1.00)
- Natural Language (1.00)
- Robots (1.00)
- Vision (1.00)
- Information Technology > Artificial Intelligence