minedojo
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
Autonomous agents have made great strides in specialist domains like Atari games and Go. However, they typically learn tabula rasa in isolated environments with limited and manually conceived objectives, thus failing to generalize across a wide spectrum of tasks and capabilities. Inspired by how humans continually learn and adapt in the open world, we advocate a trinity of ingredients for building generalist agents: 1) an environment that supports a multitude of tasks and goals, 2) a large-scale database of multimodal knowledge, and 3) a flexible and scalable agent architecture. We introduce MineDojo, a new framework built on the popular Minecraft game that features a simulation suite with thousands of diverse open-ended tasks and an internet-scale knowledge base with Minecraft videos, tutorials, wiki pages, and forum discussions. Using MineDojo's data, we propose a novel agent learning algorithm that leverages large pre-trained video-language models as a learned reward function. Our agent is able to solve a variety of open-ended tasks specified in free-form language without any manually designed dense shaping reward. We open-source the simulation suite, knowledge bases, algorithm implementation, and pretrained models (https://minedojo.org) to promote research towards the goal of generally capable embodied agents.
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
Autonomous agents have made great strides in specialist domains like Atari games and Go. However, they typically learn tabula rasa in isolated environments with limited and manually conceived objectives, thus failing to generalize across a wide spectrum of tasks and capabilities. Inspired by how humans continually learn and adapt in the open world, we advocate a trinity of ingredients for building generalist agents: 1) an environment that supports a multitude of tasks and goals, 2) a large-scale database of multimodal knowledge, and 3) a flexible and scalable agent architecture. We introduce MineDojo, a new framework built on the popular Minecraft game that features a simulation suite with thousands of diverse open-ended tasks and an internet-scale knowledge base with Minecraft videos, tutorials, wiki pages, and forum discussions. Using MineDojo's data, we propose a novel agent learning algorithm that leverages large pre-trained video-language models as a learned reward function.
CLIP4MC: An RL-Friendly Vision-Language Model for Minecraft
Ding, Ziluo, Luo, Hao, Li, Ke, Yue, Junpeng, Huang, Tiejun, Lu, Zongqing
One of the essential missions in the AI research community is to build an autonomous embodied agent that can attain high-level performance across a wide spectrum of tasks. However, acquiring reward/penalty in all open-ended tasks is unrealistic, making the Reinforcement Learning (RL) training procedure impossible. In this paper, we propose a novel cross-modal contrastive learning framework architecture, CLIP4MC, aiming to learn an RL-friendly vision-language model that serves as a reward function for open-ended tasks. Therefore, no further task-specific reward design is needed. Intuitively, it is more reasonable for the model to address the similarity between the video snippet and the language prompt at both the action and entity levels. To this end, a motion encoder is proposed to capture the motion embeddings across different intervals. The correlation scores are then used to construct the auxiliary reward signal for RL agents. Moreover, we construct a neat YouTube dataset based on the large-scale YouTube database provided by MineDojo. Specifically, two rounds of filtering operations guarantee that the dataset covers enough essential information and that the video-text pair is highly correlated. Empirically, we show that the proposed method achieves better performance on RL tasks compared with baselines.
- Research Report (0.82)
- Workflow (0.68)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)
Nvidia AI plays Minecraft, wins machine-learning conference award
A paper describing MineDojo, Nvidia's generalist AI agent that can perform actions from written prompts in Minecraft, won an Outstanding Datasets and Benchmarks Paper Award at the 2022 NeurIPS (Neural Information Processing Systems) conference, Nvidia revealed on Monday. To train the MineDojo framework to play Minecraft, researchers fed it 730,000 Minecraft YouTube videos (with more than 2.2 billion words transcribed), 7,000 scraped webpages from the Minecraft wiki, and 340,000 Reddit posts and 6.6 million Reddit comments describing Minecraft gameplay. From this data, the researchers created a custom transformer model called MineCLIP that associates video clips with specific in-game Minecraft activities. As a result, someone can tell a MineDojo agent what to do in the game using high-level natural language, such as "find a desert pyramid" or "build a nether portal and enter it," and MineDojo will execute the series of steps necessary to make it happen in the game. MineDojo aims to create a flexible agent that can generalize learned actions and apply them to different behaviors in the game.
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Games > Computer Games (1.00)
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
Autonomous agents have made great strides in specialist domains like Atari games and Go. However, they typically learn tabula rasa in isolated environments with limited and manually conceived objectives, thus failing to generalize across a wide spectrum of tasks and capabilities. Inspired by how humans continually learn and adapt in the open world, we advocate a trinity of ingredients for building generalist agents: 1) an environment that supports a multitude of tasks and goals, 2) a large-scale database of multimodal knowledge, and 3) a flexible and scalable agent architecture. We introduce MineDojo, a new framework built on the popular Minecraft game that features a simulation suite with thousands of diverse open-ended tasks and an internet-scale knowledge base with Minecraft videos, tutorials, wiki pages, and forum discussions. Using MineDojo's data, we propose a novel agent learning algorithm that leverages large pre-trained video-language models as a learned reward function. Our agent is able to solve a variety of open-ended tasks specified in free-form language without any manually designed dense shaping reward. We open-source the simulation suite and knowledge bases (https://minedojo.org) to promote research towards the goal of generally capable embodied agents.