{"id":418,"date":"2023-03-10T19:41:48","date_gmt":"2023-03-10T19:41:48","guid":{"rendered":"https:\/\/todaysainews.com\/index.php\/2023\/03\/10\/an-embodied-multimodal-language-model-google-ai-blog\/"},"modified":"2025-04-27T07:33:58","modified_gmt":"2025-04-27T07:33:58","slug":"an-embodied-multimodal-language-model-google-ai-blog","status":"publish","type":"post","link":"https:\/\/todaysainews.com\/index.php\/2023\/03\/10\/an-embodied-multimodal-language-model-google-ai-blog\/","title":{"rendered":"An embodied multimodal language model \u2013 Google AI Blog"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div id=\"post-body-7958350311586448530\">\n<span class=\"byline-author\">Posted by Danny Driess, Student Researcher, and Pete Florence, Research Scientist, Robotics at Google<\/span><\/p>\n<p><img decoding=\"async\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEi9Sc6PZ4Ebbi6Op6Lm3eYItsJJidxdqG7-zKZbRQ2L7_AKajgh7MLYybZZw3XHBvRro_GmGSOKjtyghhsNz8iXxBVODBLtbjesTkPo1lGzhwbLZVLT2k7W5QFdC2_C7no1cxeiDed75QJip1fTc9_FqKOBhGdK81pEyCzvZGfRgYji4Tvqbn2lFI2dqw\/s700\/PalmE-Lg.gif\" style=\"display: none;\"\/><\/p>\n<p>\nRecent years have seen tremendous advances across machine learning domains, from models that can <a href=\"https:\/\/ai.googleblog.com\/2022\/04\/pathways-language-model-palm-scaling-to.html\">explain jokes<\/a> or <a href=\"https:\/\/ai.googleblog.com\/2022\/09\/pali-scaling-language-image-learning-in.html\">answer visual questions<\/a> in a variety of languages to those that can <a href=\"https:\/\/imagen.research.google\/\">produce images based on text descriptions<\/a>. Such innovations have been possible due to the increase in availability of large scale datasets along with novel advances that enable the training of models on these data. While scaling of robotics models has seen <a href=\"https:\/\/ai.googleblog.com\/2022\/12\/rt-1-robotics-transformer-for-real.html\">some<\/a> <a href=\"https:\/\/ai.googleblog.com\/2018\/06\/scalable-deep-reinforcement-learning.html\">success<\/a>, it is outpaced by other domains due to a lack of datasets available on a scale comparable to large text corpora or image datasets.\n<\/p>\n<p><a name=\"more\"\/><\/p>\n<p>\nToday we introduce <a href=\"https:\/\/palm-e.github.io\">PaLM-E<\/a>, a new generalist robotics model that overcomes these issues by transferring knowledge from varied visual and language domains to a robotics system. We began with <a href=\"https:\/\/ai.googleblog.com\/2022\/04\/pathways-language-model-palm-scaling-to.html\">PaLM<\/a>, a powerful large language model, and \u201cembodied\u201d it (the \u201c<em>E<\/em>\u201d in PaLM-E), by complementing it with sensor data from the robotic agent. This is the key difference from <a href=\"https:\/\/ai.googleblog.com\/2023\/02\/google-research-2022-beyond-robotics.html\">prior efforts<\/a> to bring large language models to robotics \u2014 rather than relying on only textual input, with PaLM-E we train the language model to directly ingest raw streams of robot sensor data. The resulting model not only enables highly effective robot learning, but is also a state-of-the-art general-purpose visual-language model, while maintaining excellent language-only task capabilities.\n<\/p>\n<p><\/p>\n<p><video autoplay=\"\" loop=\"\" muted=\"\" playsinline=\"\" width=\"100%\"><source src=\"https:\/\/palm-e.github.io\/videos\/palm-e-teaser.mp4\" type=\"video\/mp4\"\/><\/video><\/p>\n<p><!-- \n\n\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n\n<tbody>\n\n<tr>\n\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEiKLqF8ZypD7shjhFyQIK2DfbFQ3e9Qo51Klf8ivfPk2fhEICHpJoeCkx60BlFmyUl-R3xJn5e90rZq0iv91DSM42m4nii6RMTn48HmsobWpSSTsZ3MCocUUfywzeZhon34vsHzxFtn_BY_l0dpLr_RB5u3_yk-JRjkZ2NroGotPnT8pJ_61mhRZyjaow\/s600\/image9.gif\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"450\" data-original-width=\"600\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEiKLqF8ZypD7shjhFyQIK2DfbFQ3e9Qo51Klf8ivfPk2fhEICHpJoeCkx60BlFmyUl-R3xJn5e90rZq0iv91DSM42m4nii6RMTn48HmsobWpSSTsZ3MCocUUfywzeZhon34vsHzxFtn_BY_l0dpLr_RB5u3_yk-JRjkZ2NroGotPnT8pJ_61mhRZyjaow\/s16000\/image9.gif\" \/><\/a><\/td>\n\n<\/tr>\n\n\n\n<tr>\n\n<td class=\"tr-caption\" style=\"text-align: center;\"><span style=\"text-align: left;\">PaLM-E is a generalist model competent with robotics, vision, and language tasks. It can control robots, answer visual questions, and write text &#8211; and quantitatively excels at all three relative to state-of-the-art models.<\/span><\/td>\n\n<\/tr>\n\n<\/tbody>\n\n<\/table>\n\n --><\/p>\n<h2>An <em>embodied<\/em> \u00a0language model, and also a visual-language generalist<\/h2>\n<p>\nOn the one hand, PaLM-E was primarily developed to be a model for robotics, and it solves a variety of tasks<em> <\/em>on <em>multiple <\/em>types of robots and for <em>multiple<\/em> modalities (images, robot states, and <a href=\"https:\/\/osrt-paper.github.io\/\">neural scene representations<\/a>). At the same time, PaLM-E is a generally-capable vision-and-language model. It can perform visual tasks, such as describing images, detecting objects, or classifying scenes, and is also proficient at language tasks, like quoting poetry, solving math equations or generating code.\n<\/p>\n<p>\nPaLM-E combines our most recent large language model, <a href=\"https:\/\/ai.googleblog.com\/2022\/04\/pathways-language-model-palm-scaling-to.html\">PaLM<\/a>, together with one of our most advanced vision models, <a href=\"https:\/\/arxiv.org\/abs\/2302.05442\">ViT-22B<\/a>. The largest instantiation of this approach, built on PaLM-540B, is called PaLM-E-562B and sets a new state of the art on the visual-language <a href=\"https:\/\/okvqa.allenai.org\/\">OK-VQA<\/a> benchmark, without task-specific fine-tuning, and while retaining essentially the same general language performance as PaLM-540B.\n<\/p>\n<p><\/p>\n<h2>How does PaLM-E work?<\/h2>\n<p>\nTechnically, PaLM-E works by injecting observations into a pre-trained language model. This is realized by transforming sensor data, e.g., images, into a representation through a procedure that is comparable to how words of natural language are processed by a language model.\n<\/p>\n<p>\nLanguage models rely on a mechanism to represent text mathematically in a way that neural networks can process. This is achieved by first splitting the text into so-called tokens that encode (sub)words, each of which is associated with a high-dimensional vector of numbers, the token embedding. The language model is then able to apply mathematical operations (e.g., matrix multiplication) on the resulting sequence of vectors to predict the next, most likely word token. By feeding the newly predicted word back to the input, the language model can iteratively generate a longer and longer text.\n<\/p>\n<p>\nThe <em>inputs<\/em> to PaLM-E are text and other modalities \u2014 images, robot states, scene embeddings, etc. \u2014 in an arbitrary order, which we call &#8220;multimodal sentences&#8221;. For example, an input might look like, &#8220;What happened between &lt;img_1&gt; and &lt;img_2&gt;?&#8221;, where &lt;img_1&gt; and &lt;img_2&gt; are two images. The <em>output<\/em> is text generated auto-regressively by PaLM-E, which could be an answer to a question, or a sequence of decisions in text form.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEgV4l0MeJLT7W5ais8ulrmeB0OKn5kDR6IWvKUIyXvtURRI2j6iG_-S7c2k05PPLOI7CTSWc3uXmwCJoMsTTHLBgXlCtyXo6dXUbnvp89CDyyU503uNPRWtNaHKuJOd0xemtohsDj9zWejyc1-Mwa8p7Xa4HblsH-NWRrLO8TGllZw11YOZcziji_Qofg\/s1242\/image6.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"610\" data-original-width=\"1242\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEgV4l0MeJLT7W5ais8ulrmeB0OKn5kDR6IWvKUIyXvtURRI2j6iG_-S7c2k05PPLOI7CTSWc3uXmwCJoMsTTHLBgXlCtyXo6dXUbnvp89CDyyU503uNPRWtNaHKuJOd0xemtohsDj9zWejyc1-Mwa8p7Xa4HblsH-NWRrLO8TGllZw11YOZcziji_Qofg\/s16000\/image6.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\"><span style=\"text-align: left;\">PaLM-E model architecture, showing how PaLM-E ingests different modalities (states and\/or images) and addresses tasks through multimodal language modeling.<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\n  The idea of PaLM-E is to train encoders that convert a variety of inputs into the same space as the natural word token embeddings. These continuous inputs are mapped into something that resembles &#8220;words&#8221; (although they do not necessarily form discrete sets). Since both the word and image embeddings now have the same dimensionality, they can be fed into the language model.<\/p>\n<p>\nWe initialize PaLM-E for training with pre-trained models for both the language (<a href=\"https:\/\/ai.googleblog.com\/2022\/04\/pathways-language-model-palm-scaling-to.html\">PaLM<\/a>) and vision components (<a href=\"https:\/\/ai.googleblog.com\/2020\/12\/transformers-for-image-recognition-at.html\">Vision Transformer<\/a>, a.k.a. ViT). All parameters of the model can be updated during training.\n<\/p>\n<p><\/p>\n<h2>Transferring knowledge from large-scale training to robots<\/h2>\n<p>\nPaLM-E offers a new paradigm for training a generalist model, which is achieved by framing robot tasks and vision-language tasks together through a common representation: taking images and text as input, and outputting text. A key result is that PaLM-E attains significant <em>positive<\/em> <em>knowledge transfer<\/em> from both the vision and language domains, improving the effectiveness of robot learning.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEid8uulU7Ri7GYs5w6vC0L0aFQ1n1ztI_aF21IiQaROW-OG6RQNtNyPeQUNJkYyKlhcper1coHCwCxgiAS_3U5mT74agXDYxCmjQ6pQ27-98bciuRUHIkGaeFl7Vc3duEwAk_rog0V_DE0VVYje9y7PMHthMrM349o5ZALimxf2VDvDrBw-uHUpY56Xng\/s1666\/image10.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"934\" data-original-width=\"1666\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEid8uulU7Ri7GYs5w6vC0L0aFQ1n1ztI_aF21IiQaROW-OG6RQNtNyPeQUNJkYyKlhcper1coHCwCxgiAS_3U5mT74agXDYxCmjQ6pQ27-98bciuRUHIkGaeFl7Vc3duEwAk_rog0V_DE0VVYje9y7PMHthMrM349o5ZALimxf2VDvDrBw-uHUpY56Xng\/s16000\/image10.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\"><span style=\"text-align: left;\">Positive\u00a0<\/span><em style=\"text-align: left;\">transfer\u00a0<\/em><span style=\"text-align: left;\">of knowledge from general vision-language tasks results in more effective robot learning, shown for three different robot embodiments and domains.<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\n  Results show that PaLM-E can address a large set of robotics, vision and language tasks simultaneously without performance degradation compared to training individual models on individual tasks. Further, the visual-language data <em>actually significantly improves<\/em> the performance of the robot tasks. This transfer enables PaLM-E to learn robotics tasks efficiently in terms of the number of examples it requires to solve a task.<\/p>\n<p><\/p>\n<h2>Results<\/h2>\n<p>\nWe evaluate PaLM-E on three robotic environments, two of which involve real robots, as well as general vision-language tasks such as visual question answering (VQA), image captioning, and general language tasks. When PaLM-E is tasked with making decisions on a robot, we pair it with a <a href=\"https:\/\/ai.googleblog.com\/2022\/12\/rt-1-robotics-transformer-for-real.html\">low-level<\/a> <a href=\"https:\/\/ai.googleblog.com\/2022\/12\/talking-to-robots-in-real-time.html\">language-to-action<\/a> policy to translate text into low-level robot actions.\n<\/p>\n<p>\nIn the first example below, a person asks a mobile robot to bring a bag of chips to them. To successfully complete the task, PaLM-E produces a plan to find the drawer and open it and then responds to changes in the world by updating its plan as it executes the task. In the second example, the robot is asked to grab a green block. Even though the block has not been seen by that robot, PaLM-E still generates a step-by-step plan that generalizes beyond the training data of that robot.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjPVL4aqAMCyuMbB3h4e7awiwuI6rvYClO6kUNQDbiBjC2Bbg4qk4xCJBXA4ytJNviWxzb-mxjJxyipMCptu58VAIh-liG9P9ck6RndpEOxJSuZCHPI6lSnMAelWyW49QuDkd7xvn5VN0FsDd6gQVqqH0yhU-rxM3zfZzpTcSpwGopdgr9gf9Od4Nh-bg\/s600\/image4.gif\" style=\"margin-left: auto; margin-right: auto;\"><img loading=\"lazy\" decoding=\"async\" border=\"0\" data-original-height=\"338\" data-original-width=\"600\" height=\"180\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjPVL4aqAMCyuMbB3h4e7awiwuI6rvYClO6kUNQDbiBjC2Bbg4qk4xCJBXA4ytJNviWxzb-mxjJxyipMCptu58VAIh-liG9P9ck6RndpEOxJSuZCHPI6lSnMAelWyW49QuDkd7xvn5VN0FsDd6gQVqqH0yhU-rxM3zfZzpTcSpwGopdgr9gf9Od4Nh-bg\/s320\/image4.gif\" width=\"320\"\/><\/a><\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEiK9xoiT4cLkwxTmxBZEV3z9vdECY3s9zHvW_NpzJ6nmZbAvuCudjU7Usf2ONcEQlwuRksWLnzC4WsJqkLfOdXwwjvivrIE2al2YhfgpQf_iEmJyWlW9mnlQ98MOrZlIXe7nIS2D6b-h3UNXXVzsgEogLXkfWVfBM9EwpM10AG2XQuRaaVwDTsH06YILQ\/s600\/image3.gif\" style=\"margin-left: auto; margin-right: auto;\"><img loading=\"lazy\" decoding=\"async\" border=\"0\" data-original-height=\"338\" data-original-width=\"600\" height=\"180\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEiK9xoiT4cLkwxTmxBZEV3z9vdECY3s9zHvW_NpzJ6nmZbAvuCudjU7Usf2ONcEQlwuRksWLnzC4WsJqkLfOdXwwjvivrIE2al2YhfgpQf_iEmJyWlW9mnlQ98MOrZlIXe7nIS2D6b-h3UNXXVzsgEogLXkfWVfBM9EwpM10AG2XQuRaaVwDTsH06YILQ\/s320\/image3.gif\" width=\"320\"\/><\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">PaLM-E controls a mobile robot operating in a kitchen environment.\u00a0<strong>Left:<\/strong> The task is to get a chip bag. PaLM-E shows robustness against adversarial disturbances, such as putting the chip bag back into the drawer. <strong>Right:<\/strong> The final steps of executing a plan to retrieve a previously unseen block (green star). This capability is facilitated by transfer learning from the vision and language models.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nIn the second environment below, the same PaLM-E model solves very long-horizon, precise tasks, such as \u201csort the blocks by colors into corners,\u201d on a different type of robot. It directly looks at the images and produces a sequence of shorter textually-represented actions \u2014 e.g., \u201cPush the blue cube to the bottom right corner,\u201d \u201cPush the blue triangle there too.\u201d \u2014 long-horizon tasks that were out of scope for autonomous completion, even in <a href=\"https:\/\/ai.googleblog.com\/2022\/12\/talking-to-robots-in-real-time.html\">our own most recent models<\/a>. We also demonstrate the ability to generalize to new tasks not seen during training time (zero-shot generalization), such as pushing red blocks to the coffee cup.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjZD0JhK1kEFMh3Pl9Tqqz_q-VCDkHb9sHXs-If8RGn6xSscTI_cCBq_lfHWhvYCCNX0EYEABRfYfOUlKCHTuYkxfJ0CFN0s2SD1jFl0Lz6u4_AiQjPh2mjSkl_30PGew3wGa9GABX7tw2bjEN3Up563pPQj-cWtzO5qDs8hFAcfbUT3XxOgHUL8X1G1A\/s600\/image8.gif\" style=\"margin-left: auto; margin-right: auto;\"><img loading=\"lazy\" decoding=\"async\" border=\"0\" data-original-height=\"338\" data-original-width=\"600\" height=\"180\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjZD0JhK1kEFMh3Pl9Tqqz_q-VCDkHb9sHXs-If8RGn6xSscTI_cCBq_lfHWhvYCCNX0EYEABRfYfOUlKCHTuYkxfJ0CFN0s2SD1jFl0Lz6u4_AiQjPh2mjSkl_30PGew3wGa9GABX7tw2bjEN3Up563pPQj-cWtzO5qDs8hFAcfbUT3XxOgHUL8X1G1A\/s320\/image8.gif\" width=\"320\"\/><\/a><\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjhPHuGsVRUyJr8cKUQeq0iGNaNnhrRpVbSW2X-UENhNCY3Ch42_RPFbeIkN54CH7SKPl3A220aZRRiY-zQhRAtwj_OZ8h7mvWU1pu9mUvWKOe6a-R1bBTDnfYxAQKwKkjz9WLDKhd7EfsG94T18wD8i3yAeXlNu3wq4-sARFl19wGnXFC2s-P_1bXFxA\/s600\/image7.gif\" style=\"margin-left: auto; margin-right: auto;\"><img loading=\"lazy\" decoding=\"async\" border=\"0\" data-original-height=\"338\" data-original-width=\"600\" height=\"180\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjhPHuGsVRUyJr8cKUQeq0iGNaNnhrRpVbSW2X-UENhNCY3Ch42_RPFbeIkN54CH7SKPl3A220aZRRiY-zQhRAtwj_OZ8h7mvWU1pu9mUvWKOe6a-R1bBTDnfYxAQKwKkjz9WLDKhd7EfsG94T18wD8i3yAeXlNu3wq4-sARFl19wGnXFC2s-P_1bXFxA\/s320\/image7.gif\" width=\"320\"\/><\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">PaLM-E controlling a tabletop robot to successfully complete long-horizon tasks.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\n  The third <a href=\"https:\/\/arxiv.org\/pdf\/2006.05398.pdf\">robot environment<\/a> is inspired by the field of <a href=\"http:\/\/ifrr.org\/task-and-motion-planning\">task and motion planning<\/a> (TAMP), which studies combinatorially challenging planning tasks (rearranging objects) that confront the robot with a very high number of possible action sequences. We show that with a modest amount of training data from an expert TAMP planner, PaLM-E is not only able to also solve these tasks, but it also leverages visual and language knowledge transfer in order to more effectively do so.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEge3-CJEJxJZlxFlsMlo-LKo6VGrAQG1ysyh8WkPTib6OWxS-A69Smu2LJEQ2nKreQd5GqYElQpnvsZxQaDjptIZ6DCSs9FkPMeawbvKZq9yg6UsRE1WqASKeljT8Ig9Zpoc0-fsx3esmPW1k8X0hy7TCLpA19pytOS-tdRhMyp1Dotx-SbgusSCTrnRg\/s480\/image1.gif\" style=\"margin-left: auto; margin-right: auto;\"><img loading=\"lazy\" decoding=\"async\" border=\"0\" data-original-height=\"480\" data-original-width=\"480\" height=\"320\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEge3-CJEJxJZlxFlsMlo-LKo6VGrAQG1ysyh8WkPTib6OWxS-A69Smu2LJEQ2nKreQd5GqYElQpnvsZxQaDjptIZ6DCSs9FkPMeawbvKZq9yg6UsRE1WqASKeljT8Ig9Zpoc0-fsx3esmPW1k8X0hy7TCLpA19pytOS-tdRhMyp1Dotx-SbgusSCTrnRg\/s320\/image1.gif\" width=\"320\"\/><\/a><\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEgps85bzuBkjUIMyhLxMHgXyFRKKE3QeJHIZfSf7UCWZFAdTbqUnCVsQNBUxfyn0x81eh2qK9JiP_1plaVEVU6yQE7o2loRmXwYYgDN0m0a6zvXS2sQ1SdxFdMDvEfImKaViTVl-I5xQ9A9CKIN9QTn2xqEzj5UicBsDc6SMPJpcxWmwxgwAnuRpqOglw\/s480\/image2.gif\" style=\"margin-left: auto; margin-right: auto;\"><img loading=\"lazy\" decoding=\"async\" border=\"0\" data-original-height=\"480\" data-original-width=\"480\" height=\"320\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEgps85bzuBkjUIMyhLxMHgXyFRKKE3QeJHIZfSf7UCWZFAdTbqUnCVsQNBUxfyn0x81eh2qK9JiP_1plaVEVU6yQE7o2loRmXwYYgDN0m0a6zvXS2sQ1SdxFdMDvEfImKaViTVl-I5xQ9A9CKIN9QTn2xqEzj5UicBsDc6SMPJpcxWmwxgwAnuRpqOglw\/s320\/image2.gif\" width=\"320\"\/><\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">PaLM-E produces plans for a task and motion planning environment.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nAs a visual-language generalist, PaLM-E is a competitive model, even compared with the best vision-language-only models, including <a href=\"https:\/\/www.deepmind.com\/blog\/tackling-multiple-tasks-with-a-single-visual-language-model\">Flamingo<\/a> and <a href=\"https:\/\/ai.googleblog.com\/2022\/09\/pali-scaling-language-image-learning-in.html\">PaLI<\/a>. In particular, PaLM-E-562B achieves the highest number ever reported on the challenging <a href=\"https:\/\/okvqa.allenai.org\/\">OK-VQA<\/a> dataset, which requires not only visual understanding but also external knowledge of the world. Further, this result is reached with a generalist model, without fine-tuning specifically on only that task.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEhqV6JoelPXqclJ8VF3PdgZ4y4X1qQxch8L731djQoUsMyQPKuSa9GdzEGBMJPTuI-9VLcLKP3DfaE9eMByKApSYzDO6Vzjb-a1nb5WF3W-aa9iFbKy43_PTr_xh8sELde6WDXD50zsTLK3-_3GSEIs3UCmgfAomqDbdSQaPKHUV_m9nOAMuu6fB97zDw\/s1781\/image5.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"1097\" data-original-width=\"1781\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEhqV6JoelPXqclJ8VF3PdgZ4y4X1qQxch8L731djQoUsMyQPKuSa9GdzEGBMJPTuI-9VLcLKP3DfaE9eMByKApSYzDO6Vzjb-a1nb5WF3W-aa9iFbKy43_PTr_xh8sELde6WDXD50zsTLK3-_3GSEIs3UCmgfAomqDbdSQaPKHUV_m9nOAMuu6fB97zDw\/s16000\/image5.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">PaLM-E exhibits capabilities like visual chain-of-thought reasoning in which the model breaks down its answering process in smaller steps, an ability that has so far only been demonstrated in the language-only domain. The model also demonstrates the ability to perform inference on multiple images although being trained on only single-image prompts. The image of the New York Knicks and Boston Celtics is under the terms <a href=\"https:\/\/creativecommons.org\/licenses\/by\/2.0\/\" style=\"text-align: left;\">CC-by-2.0<\/a> and was <a href=\"https:\/\/www.flickr.com\/photos\/27728232@N00\/8666371367\" style=\"text-align: left;\">posted to Flickr<\/a> by kowarski. The image of Kobe Bryant is in the Public Domain. The other images were taken by us.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><\/p>\n<h2>Conclusion<\/h2>\n<p>\nPaLM-E pushes the boundaries of how generally-capable models can be trained to simultaneously address vision, language and robotics while also being capable of transferring knowledge from vision and language to the robotics domain. There are additional topics investigated in further detail in the <a href=\"https:\/\/palm-e.github.io\/assets\/palm-e.pdf\">paper<\/a>, such as how to leverage <a href=\"https:\/\/osrt-paper.github.io\/\">neural scene representations<\/a> with PaLM-E and also the extent to which PaLM-E, with greater model scale, experiences less <a href=\"https:\/\/en.wikipedia.org\/wiki\/Catastrophic_interference\">catastrophic forgetting<\/a> of its language capabilities.\n<\/p>\n<p>\nPaLM-E not only provides a path towards building more capable robots that benefit from other data sources, but might also be a key enabler to other broader applications using multimodal learning, including the ability to unify tasks that have so far seemed separate.\n<\/p>\n<p><\/p>\n<h2>Acknowledgements<\/h2>\n<p>\n<em>This work was done in collaboration across several teams at Google, including the Robotics at Google team and the Brain team, and with TU Berlin. Co-authors: Igor Mordatch, Andy Zeng, Aakanksha Chowdhery, Klaus Greff, Mehdi S. M. Sajjadi, Daniel Duckworth, Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Fei Xia, Brian Ichter, Karol Hausman, Tianhe Yu, Quan Vuong, Yevgen Chebotar, Wenlong Huang, Pierre Sermanet, Sergey Levine, Vincent Vanhoucke, and Marc Toussiant. Danny is a PhD student advised by Marc Toussaint at TU Berlin. We also would like to thank several other colleagues for their advice and help, including Xi Chen, Etienne Pot, Sebastian Goodman, Maria Attarian, Ted Xiao, Keerthana Gopalakrishnan, Kehang Han, Henryk Michalewski, Neil Houlsby, Basil Mustafa, Justin Gilmer, Yonghui Wu, Erica Moreira, Victor Gomes, Tom Duerig, Mario Lucic, Henning Meyer, and Kendra Byrne.<\/em>\n<\/p>\n<\/div>\n<p>[ad_2]<br \/>\n<br \/><a href=\"http:\/\/ai.googleblog.com\/2023\/03\/palm-e-embodied-multimodal-language.html\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] Posted by Danny Driess, Student Researcher, and Pete Florence, Research Scientist, Robotics at Google Recent years have<\/p>\n","protected":false},"author":2,"featured_media":419,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-418","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-google-ai"],"_links":{"self":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/418","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/comments?post=418"}],"version-history":[{"count":1,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/418\/revisions"}],"predecessor-version":[{"id":2863,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/418\/revisions\/2863"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media\/419"}],"wp:attachment":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media?parent=418"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/categories?post=418"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/tags?post=418"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}