{"id":326,"date":"2023-02-01T19:45:05","date_gmt":"2023-02-01T19:45:05","guid":{"rendered":"https:\/\/todaysainews.com\/index.php\/2023\/02\/01\/advancing-open-source-methods-for-instruction-tuning-google-ai-blog\/"},"modified":"2025-04-27T07:35:22","modified_gmt":"2025-04-27T07:35:22","slug":"advancing-open-source-methods-for-instruction-tuning-google-ai-blog","status":"publish","type":"post","link":"https:\/\/todaysainews.com\/index.php\/2023\/02\/01\/advancing-open-source-methods-for-instruction-tuning-google-ai-blog\/","title":{"rendered":"Advancing open source methods for instruction tuning \u2013 Google AI Blog"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div id=\"post-body-3315790722141828015\">\n<span class=\"byline-author\">Posted by Shayne Longpre, Student Researcher, and Adam Roberts, Senior Staff Software Engineer, Google Research, Brain Team<\/span><\/p>\n<p><img decoding=\"async\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEiotQFCQbWQFG-jO_skmh_FqnSsh4q2DQTy2JTwLzV8cH8he3epYtMJaZNZx_nBD67sESNzHbcfgzm1VzmeHoEfgcWRKXAAvOmozfp_DfUJZVFxB0XOEr31rHVbTGOygBBD4b62qULCvIWgIXI8nHC6KPuvEM7GQ9Lb8sW-hoik7EEXgp25-9269_-Ktw\/s1197\/Flan2.png\" style=\"display: none;\"\/><\/p>\n<p>\nLanguage models are now capable of performing many new <a href=\"https:\/\/en.wikipedia.org\/wiki\/Natural_language_processing\">natural language processing<\/a> (NLP) tasks by reading instructions, often that they hadn\u2019t seen before. The ability to reason on new tasks is mostly credited to training models on a wide variety of unique instructions, known as \u201cinstruction tuning\u201d, which was introduced by <a href=\"http:\/\/ai.googleblog.com\/2021\/10\/introducing-flan-more-generalizable.html\">FLAN<\/a> and extended in <a href=\"https:\/\/arxiv.org\/abs\/2110.08207\">T0<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2204.07705\">Super-Natural Instructions<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2110.15943\">MetaICL<\/a>, and <a href=\"https:\/\/openai.com\/blog\/instruction-following\/\">InstructGPT<\/a>. However, much of the data that drives these advances remain unreleased to the broader research community.\u00a0<\/p>\n<p><a name=\"more\"\/> <\/p>\n<p>\nIn \u201c<em><a href=\"https:\/\/arxiv.org\/abs\/2301.13688\">The Flan Collection: Designing Data and Methods for Effective Instruction Tuning<\/a><\/em>\u201d, we closely examine and <a href=\"https:\/\/github.com\/google-research\/FLAN\/tree\/main\/flan\/v2\">release<\/a> a newer and more extensive publicly available collection of tasks, templates, and methods for instruction tuning to advance the community\u2019s ability to analyze and improve instruction-tuning methods. This collection was <a href=\"https:\/\/arxiv.org\/abs\/2210.11416\">first used<\/a> in Flan-T5 and Flan-PaLM, for which the latter achieved significant improvements over <a href=\"https:\/\/ai.googleblog.com\/2022\/04\/pathways-language-model-palm-scaling-to.html\">PaLM<\/a>. We show that training a model on this collection yields improved performance over comparable public collections on all tested evaluation benchmarks, e.g., a 3%+ improvement on the 57 tasks in the <a href=\"https:\/\/arxiv.org\/abs\/2009.03300\">Massive Multitask Language Understanding<\/a> (MMLU) evaluation suite and 8% improvement on <a href=\"https:\/\/arxiv.org\/abs\/2210.09261\">BigBench Hard<\/a> (BBH). Analysis suggests the improvements stem both from the larger and more diverse set of tasks and from applying a set of simple training and data augmentation techniques that are cheap and easy to implement: mixing zero-shot, few-shot, and chain of thought prompts at training, enriching tasks with input inversion, and balancing task mixtures. Together, these methods enable the resulting language models to reason more competently over arbitrary tasks, even those for which it hasn\u2019t seen any fine-tuning examples. We hope making these findings and resources publicly available will accelerate research into more powerful and general-purpose language models.<\/p>\n<h2>Public instruction tuning data collections<\/h2>\n<p>\nSince 2020, several instruction tuning task collections have been released in rapid succession, shown in the timeline below. Recent research has yet to coalesce around a unified set of techniques, with different sets of tasks, model sizes, and input formats all represented. This new collection, referred to below as \u201cFlan 2022\u201d, combines prior collections from <a href=\"http:\/\/ai.googleblog.com\/2021\/10\/introducing-flan-more-generalizable.html\">FLAN<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2110.08207\">P3\/T0<\/a>, and <a href=\"https:\/\/arxiv.org\/abs\/2204.07705\">Natural Instructions<\/a> with new dialog, program synthesis, and complex reasoning tasks.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEhkXcxHrOqj8CzonSKhoN6hLU9umKLEJ_rz80SmDPI7KfrxvHNnwitn_a1gL0qqrBGoKeR_feF3zZXW6bV16GkhFOmLhfxropAk3A1eDWCh-hmS872NvC2T1ckUg_nJStVEELYv6Nzv3ffPIAqLNPPqy61v8QmUFtqwR89vogivpr_ScSbeOOwLP8Olhw\/s1024\/image2.jpg\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"682\" data-original-width=\"1024\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEhkXcxHrOqj8CzonSKhoN6hLU9umKLEJ_rz80SmDPI7KfrxvHNnwitn_a1gL0qqrBGoKeR_feF3zZXW6bV16GkhFOmLhfxropAk3A1eDWCh-hmS872NvC2T1ckUg_nJStVEELYv6Nzv3ffPIAqLNPPqy61v8QmUFtqwR89vogivpr_ScSbeOOwLP8Olhw\/s16000\/image2.jpg\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">A timeline of public instruction tuning collections, including: <a href=\"https:\/\/arxiv.org\/abs\/2005.00700\">UnifiedQA<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2104.08835\">CrossFit<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2104.08773\">Natural Instructions<\/a>, <a href=\"http:\/\/ai.googleblog.com\/2021\/10\/introducing-flan-more-generalizable.html\">FLAN<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2110.08207\">P3\/T0<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2110.15943\">MetaICL<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2111.10952\">ExT5<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2204.07705\">Super-Natural Instructions<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2211.01786\">mT0<\/a>, <a href=\"https:\/\/arxiv.org\/pdf\/2212.09689\">Unnatural Instructions<\/a>, <a href=\"https:\/\/arxiv.org\/pdf\/2212.10560\">Self-Instruct<\/a>, and <a href=\"https:\/\/github.com\/facebookresearch\/metaseq\/tree\/main\/projects\/OPT-IML\">OPT-IML Bench<\/a>. The table describes the release date, the task collection name, the model name, the base model(s) that were finetuned with this collection, the model size, whether the resulting model is Public (green) or Not Public (red), whether they train with zero-shot prompts (\u201cZS\u201d), few-shot prompts (\u201cFS\u201d), chain-of-thought prompts (\u201cCoT\u201d) together (\u201c+\u201d) or separately (\u201c\/\u201d), the number of tasks from this collection in Flan 2022, the total number of examples, and some notable methods, related to the collections, used in these works. Note that the number of tasks and examples vary under different assumptions and so are approximations. Counts for each are reported using task definitions from the respective works.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nIn addition to scaling to more instructive training tasks, The Flan Collection combines training with different types of input-output specifications, including just instructions (zero-shot prompting), instructions with examples of the task (few-shot prompting), and instructions that ask for an explanation with the answer (<a href=\"http:\/\/ai.googleblog.com\/2022\/05\/language-models-perform-reasoning-via.html\">chain of thought prompting<\/a>). Except for <a href=\"https:\/\/openai.com\/blog\/instruction-following\/\">InstructGPT<\/a>, which leverages a collection of proprietary data, Flan 2022 is the first work to publicly demonstrate the strong benefits of mixing these prompting settings together during training. Instead of a trade-off between the various settings, mixing prompting settings during training improves all prompting settings at inference time, as shown below for both tasks held-in and held-out from the set of fine-tuning tasks.<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEhEFGOanB-8bGrvJ7baiv1iYRG-6QsyQKJ9r8gmFjS26TULxn8cbW2JNEz8tt8VxXj72EtfGUNcGtt0Au-e1OGkJSFjpA1c2p3YRAvN4lNenibJa2kOlXno3nLyk0H9uw-Y4Ugy_Pbtza6JZkBZZ50tPo7tWlr_vdZFqlrdl2ICbqbv0mnV8PBDk5bmIg\/s1011\/image5.jpg\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"452\" data-original-width=\"1011\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEhEFGOanB-8bGrvJ7baiv1iYRG-6QsyQKJ9r8gmFjS26TULxn8cbW2JNEz8tt8VxXj72EtfGUNcGtt0Au-e1OGkJSFjpA1c2p3YRAvN4lNenibJa2kOlXno3nLyk0H9uw-Y4Ugy_Pbtza6JZkBZZ50tPo7tWlr_vdZFqlrdl2ICbqbv0mnV8PBDk5bmIg\/s16000\/image5.jpg\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">Training jointly with zero-shot and few-shot prompt templates improves performance on both held-in and held-out tasks. The stars indicate the peak performance in each setting. Red lines denote the zero-shot prompted evaluation, lilac denotes few-shot prompted evaluation.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Evaluating instruction tuning methods<\/h2>\n<p>\nTo understand the overall effects of swapping one instruction tuning collection for another, we fine-tune equivalently-sized <a href=\"https:\/\/ai.googleblog.com\/2020\/02\/exploring-transfer-learning-with-t5.html\">T5<\/a> models on popular public instruction-tuning collections, including Flan 2021, T0++, and Super-Natural Instructions. Each model is then evaluated on a set of tasks that are already included in each of the instruction tuning collections, a set of five chain-of-thought tasks, and then a set of 57 diverse tasks from the <a href=\"https:\/\/arxiv.org\/abs\/2009.03300\">MMLU<\/a> benchmark, both with zero-shot and few-shot prompts. In each case, the new Flan 2022 model, Flan-T5, outperforms these prior works, demonstrating a more powerful general-purpose NLP reasoner.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEix4heeoEajwR1CLyybk_DOKCkKNUW2DpFn1fjfS61F6cqdAS0xC98HkkHGw5hKmRDnIwd-BTeWua-GJBVruUJ1Dt-8L3Nb9Py_Rh5UxIH0VjDq_-CS54_kIhs2tYvVwDcW8dvaWx9Fqi7htC67n-eN-bkaZAbCSZm1gkPJTujIFwYnsSLadzDnZ-pRBA\/s818\/image1.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"322\" data-original-width=\"818\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEix4heeoEajwR1CLyybk_DOKCkKNUW2DpFn1fjfS61F6cqdAS0xC98HkkHGw5hKmRDnIwd-BTeWua-GJBVruUJ1Dt-8L3Nb9Py_Rh5UxIH0VjDq_-CS54_kIhs2tYvVwDcW8dvaWx9Fqi7htC67n-eN-bkaZAbCSZm1gkPJTujIFwYnsSLadzDnZ-pRBA\/s16000\/image1.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">Comparing public instruction tuning collections on held-in, chain-of-thought, and held-out evaluation suites, such as <a href=\"https:\/\/github.com\/suzgunmirac\/BIG-Bench-Hard\">BigBench Hard<\/a> and <a href=\"https:\/\/arxiv.org\/abs\/2009.03300\">MMLU<\/a>. All models except OPT-IML-Max (175B) are trained by us, using T5-XL with 3B parameters. Green text indicates improvement over the next best comparable T5-XL (3B) model.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Single task fine-tuning<\/h2>\n<p>\nIn applied settings, practitioners usually deploy NLP models fine-tuned specifically for one target task, where training data is already available. We examine this setting to understand how Flan-T5 compares to T5 models as a starting point for applied practitioners. Three settings are compared: fine-tuning T5 directly on the target task, using Flan-T5 without further fine-tuning on the target task, and fine-tuning Flan-T5 on the target task. For both held-in and held-out tasks, fine-tuning Flan-T5 offers an improvement over fine-tuning T5 directly. In some instances, usually where training data is limited for a target task, Flan-T5 without further fine-tuning outperforms T5 <em>with<\/em> direct fine-tuning.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjsGaMDw-uZ7R6Mwi-VFjcApudRojyqwPIVQSEIdjwaYQUjSdx1T9lWlD3n34JNEJ9ektixpsNj3FbH9C_i-R-BMKxZXheW4Sin5OnlXqvkWdep7KtbrX9FtmPbRXwoZZSjhr0xszmE129Fbo_J_EJ6_VgI9wiS-HJ9PkFnM2wpXZTv2Pk6DMZgcJ_fLQ\/s1999\/image3.jpg\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"1398\" data-original-width=\"1999\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjsGaMDw-uZ7R6Mwi-VFjcApudRojyqwPIVQSEIdjwaYQUjSdx1T9lWlD3n34JNEJ9ektixpsNj3FbH9C_i-R-BMKxZXheW4Sin5OnlXqvkWdep7KtbrX9FtmPbRXwoZZSjhr0xszmE129Fbo_J_EJ6_VgI9wiS-HJ9PkFnM2wpXZTv2Pk6DMZgcJ_fLQ\/s16000\/image3.jpg\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">Flan-T5 outperforms T5 on single-task fine-tuning. We compare single-task fine-tuned T5 (blue bars), single-task fine-tuned Flan-T5 (red), and Flan-T5 without any further fine-tuning (beige).<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nAn additional benefit of using Flan-T5 as a starting point is that training is significantly faster and cheaper, converging more quickly than T5 fine-tuning, and usually peaking at higher accuracies. This suggests less task-specific training data may be necessary to achieve similar or better results on a particular task.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEgV8d_JJE9UeHHRkHaLKhupZoeNGtV0LxOlHbw3Tpow9oPoZv2m4oOMRYdm-N6kyrqjMd_KriEs57lnHsYVAilpbuH_KsG1pZvIz7u_kCH1_wxV2hd5q24bPbxCk3XCjjHRbrCx6Pl_jfPIdAXMItXrGqRpt59Ou0KL1-M68J_dKXvPu4sJwFFO2kJ26w\/s1999\/image4.jpg\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"550\" data-original-width=\"1999\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEgV8d_JJE9UeHHRkHaLKhupZoeNGtV0LxOlHbw3Tpow9oPoZv2m4oOMRYdm-N6kyrqjMd_KriEs57lnHsYVAilpbuH_KsG1pZvIz7u_kCH1_wxV2hd5q24bPbxCk3XCjjHRbrCx6Pl_jfPIdAXMItXrGqRpt59Ou0KL1-M68J_dKXvPu4sJwFFO2kJ26w\/s16000\/image4.jpg\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">Flan-T5 converges faster than T5 on single-task fine-tuning, for each of five held-out tasks from Flan fine-tuning. Flan-T5\u2019s learning curve is indicated with the solid lines, and T5\u2019s learning curve with the dashed line. All tasks are held-out during Flan finetuning.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nThere are significant energy efficiency benefits for the NLP community to adopt instruction-tuned models like Flan-T5 for single task fine-tuning, rather than conventional non-instruction-tuned models. While pre-training and instruction fine-tuning are financially and computationally expensive, they are a one-time cost, usually <a href=\"https:\/\/arxiv.org\/abs\/2108.07258\">amortized over millions of subsequent fine-tuning runs<\/a>, which can become more costly in aggregate, for the most prominent models. Instruction-tuned models offer a promising solution in significantly reducing the amount of fine-tuning steps needed to achieve the same or better performance.\n<\/p>\n<h2>Conclusion<\/h2>\n<p>\nThe new Flan instruction tuning collection unifies the most popular prior public collections and their methods, while adding new templates and simple improvements like training with mixed prompt settings. The resulting method outperforms Flan, P3, and Super-Natural Instructions on held-in, chain of thought, MMLU, and BBH benchmarks by 3\u201317% across zero-shot and few-shot variants. Results suggest this new collection serves as a more performant starting point for researchers and practitioners interested in both generalizing to new instructions or fine-tuning on a single new task.\n<\/p>\n<h2>Acknowledgements<\/h2>\n<p>\n<em>It was a privilege to work with Jason Wei, Barret Zoph, Le Hou, Hyung Won Chung, Tu Vu, Albert Webson, Denny Zhou, and Quoc V Le on this project.<\/em>\n<\/p>\n<\/div>\n<p>[ad_2]<br \/>\n<br \/><a href=\"http:\/\/ai.googleblog.com\/2023\/02\/the-flan-collection-advancing-open.html\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] Posted by Shayne Longpre, Student Researcher, and Adam Roberts, Senior Staff Software Engineer, Google Research, Brain Team<\/p>\n","protected":false},"author":2,"featured_media":327,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-326","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-google-ai"],"_links":{"self":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/326","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/comments?post=326"}],"version-history":[{"count":1,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/326\/revisions"}],"predecessor-version":[{"id":2907,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/326\/revisions\/2907"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media\/327"}],"wp:attachment":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media?parent=326"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/categories?post=326"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/tags?post=326"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}