skill predictor
Appendix ABroaderSocietalImpact
Our intention is for this algorithm to be used in a real-world setting where humans can provide natural language instructions to robots that can carry them out. For the LOReL baseline, we used the numbers from the paper for LOReL Images. Unless otherwise specified, we use the following settings. D.4 MICalculation We calculate mutual information between the language instructions (L) and the skill codes (z) by writing MI(L,z)=H(L) H(L|z). RephrasalType Flat LISA seen 15 40 unseennoun 13.33 33.33 unseenverb 28.33 30 unseennoun+verb 6.7 20 human 26.98 27.35 Thismeansweare using a manual (human) skill predictor as opposed to our trained skill predictor.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Appendix A Broader Societal Impact
We introduce a new method for language-conditioned imitation learning to perform complex navigation and manipulation tasks. Our intention is for this algorithm to be used in a real-world setting where humans can provide natural language instructions to robots that can carry them out. This is the best we can do since we don't know exactly which tokens in the instruction correspond to the skills chosen. We have provided details about the levels we evaluated on below. More details can be found in the original paper.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution
Liang, Zhixuan, Mu, Yao, Ma, Hengbo, Tomizuka, Masayoshi, Ding, Mingyu, Luo, Ping
Diffusion models have demonstrated strong potential for robotic trajectory planning. However, generating coherent and long-horizon trajectories from high-level instructions remains challenging, especially for complex tasks requiring multiple sequential skills. We propose SkillDiffuser, an end-to-end hierarchical planning framework integrating interpretable skill learning with conditional diffusion planning to address this problem. At the higher level, the skill abstraction module learns discrete, human-understandable skill representations from visual observations and language instructions. These learned skill embeddings are then used to condition the diffusion model to generate customized latent trajectories aligned with the skills. It allows for generating diverse state trajectories that adhere to the learnable skills. By integrating skill learning with conditional trajectory generation, SkillDiffuser produces coherent behavior following abstract instructions across diverse tasks. Experiments on multi-task robotic manipulation benchmarks like Meta-World and LOReL demonstrate state-of-the-art performance and human-interpretable skill representations from SkillDiffuser.
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > California > Alameda County > Oakland (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- (3 more...)
LISA: Learning Interpretable Skill Abstractions from Language
Garg, Divyansh, Vaidyanath, Skanda, Kim, Kuno, Song, Jiaming, Ermon, Stefano
Learning policies that effectively utilize language instructions in complex, multi-task environments is an important problem in sequential decision-making. While it is possible to condition on the entire language instruction directly, such an approach could suffer from generalization issues. In our work, we propose \emph{Learning Interpretable Skill Abstractions (LISA)}, a hierarchical imitation learning framework that can learn diverse, interpretable primitive behaviors or skills from language-conditioned demonstrations to better generalize to unseen instructions. LISA uses vector quantization to learn discrete skill codes that are highly correlated with language instructions and the behavior of the learned policy. In navigation and robotic manipulation environments, LISA outperforms a strong non-hierarchical Decision Transformer baseline in the low data regime and is able to compose learned skills to solve tasks containing unseen long-range instructions. Our method demonstrates a more natural way to condition on language in sequential decision-making problems and achieve interpretable and controllable behavior with the learned skills.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.87)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)