{"id":68,"date":"2023-01-24T12:17:35","date_gmt":"2023-01-24T12:17:35","guid":{"rendered":"https:\/\/todaysainews.com\/index.php\/2023\/01\/24\/learning-to-play-minecraft-with-video-pretraining-vpt\/"},"modified":"2025-04-27T07:36:22","modified_gmt":"2025-04-27T07:36:22","slug":"learning-to-play-minecraft-with-video-pretraining-vpt","status":"publish","type":"post","link":"https:\/\/todaysainews.com\/index.php\/2023\/01\/24\/learning-to-play-minecraft-with-video-pretraining-vpt\/","title":{"rendered":"Learning to Play Minecraft with Video PreTraining (VPT)"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div>\n        <!--kg-card-begin: markdown--><\/p>\n<div class=\"js-excerpt\">\n<p>We trained a neural network to play Minecraft by Video PreTraining (VPT) on a massive unlabeled video dataset of human Minecraft play, while using only a small amount of labeled contractor data. With fine-tuning, our model can learn to craft diamond tools, a task that usually takes proficient humans over 20 minutes (24,000 actions). Our model uses the native human interface of keypresses and mouse movements, making it quite general, and represents a step towards general computer-using agents.<\/p>\n<\/div>\n<section class=\"btns mb-2\"><a href=\"https:\/\/arxiv.org\/abs\/2206.11795\" class=\"btn btn-ypadded pl-0.125 d-block icon-paper\">Read Paper<\/a><\/p>\n<hr class=\"my-0\"\/><a href=\"https:\/\/github.com\/openai\/Video-Pre-Training\" class=\"btn btn-ypadded pl-0.125 d-block icon-code\">View Code and model weights<\/a><\/p>\n<hr class=\"my-0\"\/><a href=\"https:\/\/www.aicrowd.com\/challenges\/neurips-2022-minerl-basalt-competition\" class=\"btn btn-ypadded pl-0.125 d-block icon-external\">MineRL Competition<\/a><br \/>\n<\/section>\n<p>The internet contains an enormous amount of publicly available videos that we can learn from. You can watch a person make a gorgeous presentation, a digital artist draw a beautiful sunset, and a Minecraft player build an intricate house. However, these videos only provide a record of <em>what<\/em> happened but not precisely <em>how<\/em> it was achieved, i.e. you will not know the exact sequence of mouse movements and keys pressed. If we would like to build large-scale <a href=\"https:\/\/arxiv.org\/abs\/2108.07258\">foundation models<\/a> in these domains as we\u2019ve done in language with <a href=\"https:\/\/proceedings.neurips.cc\/paper\/2020\/hash\/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html\">GPT<\/a>, this lack of action labels poses a new challenge not present in the language domain, where \u201caction labels\u201d are simply the next words in a sentence.<\/p>\n<p>In order to utilize the wealth of unlabeled video data available on the internet, we introduce a novel, yet simple, semi-supervised imitation learning method: Video PreTraining (VPT). We start by gathering a small dataset from contractors where we record not only their video, but also the actions they took, which in our case are keypresses and mouse movements. With this data we train an inverse dynamics model (IDM), which predicts the action being taken at each step in the video. Importantly, the IDM can use past <em>and future<\/em> information to guess the action at each step. This task is much easier and thus requires far less data than the behavioral cloning task of predicting actions given <em>past video frames only<\/em>, which requires inferring what the person wants to do and how to accomplish it. We can then use the trained IDM to label a much larger dataset of online videos and learn to act via behavioral cloning.<\/p>\n<p><!-- overview graphic --><\/p>\n<figure id=\"vpt-overview\" class=\"mt-2 mb-3\">\n<div class=\"d-none d-md-block wide my-0\">\n    <img decoding=\"async\" src=\"https:\/\/cdn.openai.com\/vpt\/overview.svg\" class=\"mx-xl-auto\" style=\"max-width:880px\"\/><\/div>\n<div class=\"d-md-none\">\n    <img decoding=\"async\" src=\"https:\/\/cdn.openai.com\/vpt\/overview-vertical.svg\" style=\"max-width:100%;width:330px\"\/><\/div><figcaption class=\"mt-0\">VPT method overview<\/figcaption><\/figure>\n<h2 id=\"vpt-zero-shot-results\">VPT Zero-Shot Results<\/h2>\n<p>We chose to validate our method in Minecraft because it (1) is one of the most actively played video games in the world and thus has a wealth of freely available video data and (2) is open-ended with a wide variety of things to do, similar to real-world applications such as computer usage. Unlike <a href=\"https:\/\/arxiv.org\/abs\/2106.14876\">prior<\/a> <a href=\"https:\/\/arxiv.org\/abs\/2009.14108\">works<\/a> in Minecraft that use simplified action spaces aimed at easing exploration, our AI uses the much more generally applicable, though also much more difficult, native human interface: 20Hz framerate with the mouse and keyboard.<\/p>\n<p>Trained on 70,000 hours of IDM-labeled online video, our behavioral cloning model (the \u201cVPT foundation model\u201d) accomplishes tasks in Minecraft that are nearly impossible to achieve with reinforcement learning from scratch. It learns to chop down trees to collect logs, craft those logs into planks, and then craft those planks into a crafting table; this sequence takes a human proficient in Minecraft approximately 50 seconds or 1,000 consecutive game actions.<\/p>\n<p><!-- crafting table sequence --><\/p>\n<figure id=\"crafting-table-sequence\">\n<div class=\"d-none d-md-block\">\n    <img decoding=\"async\" src=\"https:\/\/cdn.openai.com\/vpt\/crafting-table-sequence.svg\"\/><\/div>\n<div class=\"d-md-none\" style=\"max-width:100%;width:200px\">\n    <img decoding=\"async\" src=\"https:\/\/cdn.openai.com\/vpt\/crafting-table-sequence-vertical.svg\"\/><\/div><figcaption class=\"mt-n0.75\">Sequence of items required to craft a crafting table, labeled with the median time it takes proficient humans to reach each step<\/figcaption><\/figure>\n<figure class=\"overflow-hidden my-2\"><iframe loading=\"lazy\" src=\"https:\/\/player.vimeo.com\/video\/719971231?h=cbdf2617a1&amp;autopause=0&amp;autoplay=1&amp;loop=1&amp;muted=1&amp;playsinline=1&amp;transparent=1\" width=\"640\" height=\"326\" frameborder=\"0\" allow=\"autoplay; fullscreen\" allowfullscreen=\"\"><\/iframe><figcaption class=\"mt-0\">Crafting of a crafting table &#8220;zero shot&#8221; (i.e. after pre-training only without additional fine-tuning)<\/figcaption><\/figure>\n<p>Additionally, the model performs other complex skills humans often do in the game, such as swimming, hunting animals for food, and eating that food. It also learned the skill of \u201cpillar jumping\u201d, a common behavior in Minecraft of elevating yourself by repeatedly jumping and placing a block underneath yourself.<\/p>\n<p><!-- end .wide --><\/p>\n<h2 id=\"fine-tuning-with-behavioral-cloning\">Fine-tuning with Behavioral Cloning<\/h2>\n<p>Foundation models are designed to have a broad behavior profile and be generally capable across a wide variety of tasks. To incorporate new knowledge or allow them to specialize on a narrower task distribution, it is common practice to fine-tune these models to smaller, more specific datasets. As a case study into how well the VPT foundation model can be fine-tuned to downstream datasets, we asked our contractors to play for 10 minutes in brand new Minecraft worlds and build a house from basic Minecraft materials. We hoped that this would amplify the foundation model\u2019s ability to reliably perform \u201cearly game\u201d skills such as building crafting tables. When fine-tuning to this dataset, not only do we see a massive improvement in reliably performing the early game skills already present in the foundation model, but the fine-tuned model also learns to go even deeper into the technology tree by crafting both wooden and stone tools. Sometimes we even see some rudimentary shelter construction and the agent searching through villages, including raiding chests.<\/p>\n<p><!-- stone pickaxe sequence --><\/p>\n<figure id=\"stone-pickaxe-sequence\" class=\"mb-2\">\n<div class=\"wide d-none d-md-block mb-0\" style=\"max-width:820px\">\n    <img decoding=\"async\" src=\"https:\/\/cdn.openai.com\/vpt\/stone-pickaxe-sequence.svg\"\/><\/div>\n<div class=\"d-md-none\" style=\"max-width:100%;width:200px\">\n    <img decoding=\"async\" src=\"https:\/\/cdn.openai.com\/vpt\/stone-pickaxe-sequence-vertical.svg\"\/><\/div><figcaption class=\"mt-n0.75\">Sequence of items required to craft a stone pickaxe, labeled with the median time it takes proficient humans to reach each step<\/figcaption><\/figure>\n<h5 class=\"h5-alt\">\nImproved early game behavior from BC fine-tuning<br \/>\n<\/h5>\n<div class=\"wide\">\n<p>\n  <!-- 1a --><\/p>\n<figure class=\"col-12 col-sm-6 col-md-4 col-xl mb-xl-0\"><iframe loading=\"lazy\" src=\"https:\/\/player.vimeo.com\/video\/720045863?h=060f07e290&amp;autopause=0&amp;autoplay=1&amp;loop=1&amp;muted=1&amp;playsinline=1&amp;transparent=1\" width=\"640\" height=\"326\" frameborder=\"0\" allow=\"autoplay; fullscreen\" allowfullscreen=\"\"><\/iframe><figcaption class=\"mt-0\">Crafting a stone pickaxe<\/figcaption><\/figure>\n<p><!-- 1b --><\/p>\n<figure class=\"col-12 col-sm-6 col-md-4 col-xl mb-xl-0\"><iframe loading=\"lazy\" src=\"https:\/\/player.vimeo.com\/video\/720045849?h=00398908ed&amp;autopause=0&amp;autoplay=1&amp;loop=1&amp;muted=1&amp;playsinline=1&amp;transparent=1\" width=\"640\" height=\"326\" frameborder=\"0\" allow=\"autoplay; fullscreen\" allowfullscreen=\"\"><\/iframe><figcaption class=\"mt-0\">Constructing a rudimentary wooden shelter<\/figcaption><\/figure>\n<p><!-- 1c --><\/p>\n<figure class=\"col-12 col-sm-6 col-md-4 col-xl mb-xl-0\"><iframe loading=\"lazy\" src=\"https:\/\/player.vimeo.com\/video\/720045834?h=9cb4118c65&amp;autopause=0&amp;autoplay=1&amp;loop=1&amp;muted=1&amp;playsinline=1&amp;transparent=1\" width=\"640\" height=\"326\" frameborder=\"0\" allow=\"autoplay; fullscreen\" allowfullscreen=\"\"><\/iframe><figcaption class=\"mt-0\">Searching through a village<\/figcaption><\/figure>\n<\/p>\n<p> <!-- end .row -->\n<\/div>\n<p> <!-- end .wide --><\/p>\n<h2 id=\"data-scaling\">Data Scaling<\/h2>\n<p>Perhaps the most important hypothesis of our work is that it is far more effective to use labeled contractor data to train an IDM (as part of the VPT pipeline) than it is to directly train a BC foundation model from that same small contractor dataset. To validate this hypothesis we train foundation models on increasing amounts of data from 1 to 70,000 hours. Those trained on under 2,000 hours of data are trained on the contractor data with ground-truth labels that were originally collected to train the IDM, and those trained on over 2,000 hours are trained on internet data labeled with our IDM. We then take each foundation model and fine-tune it to the house building dataset described in the previous section.<\/p>\n<h5 class=\"h5-alt\">\nEffect of foundation model training data on fine-tuning<br \/>\n<\/h5>\n<p>As foundation model data increases, we generally see an increase in crafting ability, and only at the largest data scale do we see the emergence of stone tool crafting.<\/p>\n<h2 id=\"fine-tuning-with-reinforcement-learning\">Fine-Tuning with Reinforcement Learning<\/h2>\n<p>When it is possible to specify a reward function, reinforcement learning (RL) can be a powerful method for eliciting high, potentially even super-human, performance. However, many tasks require overcoming hard exploration challenges, and most RL methods tackle these with <em>random<\/em> exploration priors, e.g. models are often incentivized to act randomly via entropy bonuses. The VPT model should be a much better prior for RL because emulating human behavior is likely much more helpful than taking random actions. We set our model the challenging task of collecting a diamond pickaxe, an unprecedented capability in Minecraft made all the more difficult when using the native human interface.<\/p>\n<p>Crafting a diamond pickaxe requires a long and complicated sequence of subtasks. To make this task tractable, we reward agents for each item in the sequence.<\/p>\n<p><!-- diamond pickaxe sequence --><\/p>\n<div id=\"diamond-pickaxe-sequence\">\n<div class=\"d-none d-md-block wide\">\n    <img decoding=\"async\" src=\"https:\/\/cdn.openai.com\/vpt\/diamond-pickaxe-sequence.svg\"\/><\/div>\n<div class=\"d-md-none my-2\">\n    <img decoding=\"async\" src=\"https:\/\/cdn.openai.com\/vpt\/diamond-pickaxe-sequence-vertical.svg\"\/><\/div>\n<\/div>\n<figure class=\"overflow-hidden my-2\"><iframe loading=\"lazy\" src=\"https:\/\/player.vimeo.com\/video\/722260782?h=fec4fe5af9&amp;autopause=0&amp;autoplay=1&amp;loop=1&amp;muted=1&amp;playsinline=1&amp;transparent=1\" width=\"640\" height=\"326\" frameborder=\"0\" allow=\"autoplay; fullscreen\" allowfullscreen=\"\"><\/iframe><figcaption class=\"mt-0\">RL fine-tuned VPT model crafting a diamond pickaxe<\/figcaption><\/figure>\n<p>We found that an RL policy trained from a random initialization (the standard RL method) barely achieves any reward, never learning to collect logs and only rarely collecting sticks. In stark contrast, fine-tuning from a VPT model not only learns to craft diamond pickaxes (which it does in 2.5% of 10-minute Minecraft episodes), but it even has a human-level success rate at collecting all items leading up to the diamond pickaxe. This is the first time anyone has shown a computer agent capable of crafting diamond tools in Minecraft, which takes humans over 20 minutes (24,000 actions) on average.<\/p>\n<h5 class=\"h5-alt\">\nReward over episodes<br \/>\n<\/h5>\n<h2 id=\"conclusion\">Conclusion<\/h2>\n<p>VPT paves the path toward allowing agents to <em>learn to act<\/em> by watching the vast numbers of videos on the internet. Compared to generative video modeling or contrastive methods that would only yield <em>representational<\/em> priors, VPT offers the exciting possibility of directly learning large scale <em>behavioral priors<\/em> in more domains than just language. While we only experiment in Minecraft, the game is very open-ended  and the native human interface (mouse and keyboard) is very generic, so we believe our results bode well for other similar domains, e.g. computer usage.<\/p>\n<p>For more information, please see <a href=\"https:\/\/arxiv.org\/abs\/2206.11795\">our paper<\/a>. We are also open sourcing our contractor data, Minecraft environment, model code, and model weights, which we hope will aid future research into VPT. Furthermore, we have partnered with the MineRL NeurIPS\u00a0competition this year. Contestants can use and fine-tune our models to try to solve many difficult tasks in Minecraft. Those interested can check out the <a href=\"https:\/\/www.aicrowd.com\/challenges\/neurips-2022-minerl-basalt-competition\">competition webpage<\/a> and compete for a blue-sky prize of <span>$100,000<\/span> in addition to a regular prize pool of <span>$20,000<\/span>. Grants are available to self-identified underrepresented groups and individuals.<\/p>\n<p><!--kg-card-end: markdown--><\/div>\n<p>[ad_2]<br \/>\n<br \/><a href=\"https:\/\/openai.com\/blog\/vpt\/\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] We trained a neural network to play Minecraft by Video PreTraining (VPT) on a massive unlabeled video<\/p>\n","protected":false},"author":2,"featured_media":69,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[19],"tags":[],"class_list":["post-68","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-openai"],"_links":{"self":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/68","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/comments?post=68"}],"version-history":[{"count":1,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/68\/revisions"}],"predecessor-version":[{"id":3026,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/68\/revisions\/3026"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media\/69"}],"wp:attachment":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media?parent=68"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/categories?post=68"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/tags?post=68"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}