{"id":544,"date":"2023-05-15T21:53:19","date_gmt":"2023-05-15T21:53:19","guid":{"rendered":"https:\/\/todaysainews.com\/index.php\/2023\/05\/15\/larger-language-models-do-in-context-learning-differently-google-ai-blog\/"},"modified":"2025-04-27T07:33:33","modified_gmt":"2025-04-27T07:33:33","slug":"larger-language-models-do-in-context-learning-differently-google-ai-blog","status":"publish","type":"post","link":"https:\/\/todaysainews.com\/index.php\/2023\/05\/15\/larger-language-models-do-in-context-learning-differently-google-ai-blog\/","title":{"rendered":"Larger language models do in-context learning differently \u2013 Google AI Blog"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div id=\"post-body-14165165832846745\">\n<span class=\"byline-author\">Posted by Jerry Wei, Student Researcher, and Denny Zhou, Principal Scientist, Google Research<\/span><\/p>\n<p><img decoding=\"async\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEgQS0-Pkd_JiN7brkU6zV0-FEisuUcnKs0I56t37HVYKMgKStmwNgLREYn5CfURvW-lMvKbHU4cEV9elAu9qe4-M_FvveTlvBQHezbksTlH3YfOAk4TyJiXYiGBW_95RGKIW-JyjAQiC0Zd4VIjZrCSIm1PEBqrIAqbiEklluNunTOMhX_7CU9Degbwqg\/s800\/SULICL.png\" style=\"display: none;\"\/><\/p>\n<p>\nThere have recently been tremendous advances in language models, partly because they can perform tasks with strong performance via <a href=\"https:\/\/en.wikipedia.org\/wiki\/Few-shot_learning_(natural_language_processing)\">in-context learning<\/a> (ICL), a process whereby models are prompted with a few examples of input-label pairs before performing the task on an unseen evaluation example. In general, models\u2019 success at in-context learning is enabled by:\n<\/p>\n<p> <a name=\"more\"\/><\/p>\n<ul>\n<li>Their use of semantic prior knowledge from pre-training to predict labels while following the format of in-context examples (e.g., seeing examples of movie reviews with \u201cpositive sentiment\u201d and \u201cnegative sentiment\u201d as labels and performing <a href=\"https:\/\/en.wikipedia.org\/wiki\/Sentiment_analysis\">sentiment analysis<\/a> using prior knowledge).\n<\/li>\n<li>Learning the input-label mappings in context from the presented examples (e.g., finding a pattern that positive reviews should be mapped to one label, and negative reviews should be mapped to a different label).\n<\/li>\n<\/ul>\n<p>\nIn \u201c<a href=\"https:\/\/arxiv.org\/abs\/2303.03846\">Larger language models do in-context learning differently<\/a>\u201d, we aim to learn about how these two factors (semantic priors and input-label mappings) interact with each other in ICL settings, especially with respect to the scale of the language model that\u2019s used. We investigate two settings to study these two factors \u2014 ICL with flipped labels (flipped-label ICL) and ICL with semantically-unrelated labels (SUL-ICL). In flipped-label ICL, labels of in-context examples are flipped so that semantic priors and input-label mappings disagree with each other. In SUL-ICL, labels of in-context examples are replaced with words that are semantically unrelated to the task presented in-context. We found that overriding prior knowledge is an emergent ability of model scale, as is the ability to learn in-context with semantically-unrelated labels. We also found that instruction tuning strengthens the use of prior knowledge more than it increases the capacity to learn input-label mappings.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEgJUQZqnsXprXhMMpPVS3eqvH57D4U4G9xvmXH3rQi1KmIQ45f3a5621hbc2T_gW6ELMUv29ZxqKxfZLVlt8rMRUDfLK2hxPoIpRR-H2D7n_ZUXX7uKunqXFb_x2GOgQdbPsl0JeiSagA0VjP4N9hT-RLHDZbVz7lg-prtnONvkShF7uHGrmSAVURsveA\/s625\/image3.png\" imageanchor=\"1\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"466\" data-original-width=\"625\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEgJUQZqnsXprXhMMpPVS3eqvH57D4U4G9xvmXH3rQi1KmIQ45f3a5621hbc2T_gW6ELMUv29ZxqKxfZLVlt8rMRUDfLK2hxPoIpRR-H2D7n_ZUXX7uKunqXFb_x2GOgQdbPsl0JeiSagA0VjP4N9hT-RLHDZbVz7lg-prtnONvkShF7uHGrmSAVURsveA\/s16000\/image3.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">An overview of flipped-label ICL and semantically-unrelated label ICL (SUL-ICL), compared with regular ICL, for a sentiment analysis task. Flipped-label ICL uses flipped labels, forcing the model to override semantic priors in order to follow the in-context examples. SUL-ICL uses labels that are not semantically related to the task, which means that models must learn input-label mappings in order to perform the task because they can no longer rely on the semantics of natural language labels.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Experiment design<\/h2>\n<p>\nFor a diverse dataset mixture, we experiment on seven <a href=\"https:\/\/en.wikipedia.org\/wiki\/Natural_language_processing\">natural language processing<\/a> (NLP) tasks that have been widely used: <a href=\"https:\/\/huggingface.co\/datasets\/sst2\">sentiment analysis<\/a>, <a href=\"https:\/\/huggingface.co\/datasets\/SetFit\/subj\">subjective\/objective classification<\/a>, <a href=\"https:\/\/huggingface.co\/datasets\/trec\">question classification<\/a>, <a href=\"https:\/\/huggingface.co\/datasets\/glue\/viewer\/qqp\/validation\">duplicated-question recognition<\/a>, <a href=\"https:\/\/huggingface.co\/datasets\/super_glue\/viewer\/rte\/test\">entailment recognition<\/a>, <a href=\"https:\/\/huggingface.co\/datasets\/financial_phrasebank\">financial sentiment analysis<\/a>, and <a href=\"https:\/\/huggingface.co\/datasets\/ethos\">hate speech detection<\/a>. We test five language model families, <a href=\"https:\/\/arxiv.org\/abs\/2204.02311\">PaLM<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2210.11416\">Flan-PaLM<\/a>, <a href=\"https:\/\/papers.nips.cc\/paper\/2020\/hash\/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html\">GPT-3<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2203.02155\">InstructGPT<\/a>, and <a href=\"https:\/\/arxiv.org\/abs\/2107.03374\">Codex<\/a>.\n<\/p>\n<h2>Flipped labels<\/h2>\n<p>\nIn this experiment, labels of in-context examples are flipped, meaning that prior knowledge and input-label mappings disagree (e.g., sentences containing positive sentiment labeled as \u201cnegative sentiment\u201d), thereby allowing us to study whether models can override their priors. In this setting, models that are able to override prior knowledge and learn input-label mappings in-context should experience a decrease in performance (since ground-truth evaluation labels are not flipped).\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjSlRhZycuNyJePf45hjI8TYSDe5Xbn1Uj9R0DobtZLRy8nTScnl-V7f-Zti6qPpprSHLOac5HqavO8JWg1fy6_0VisA40LVyXAv9MzHQm3Xvkr9WyuktlOqbfga3uaVOCVlhoxGTOZ1qWWznWIvf6NcMC1UgmnDUVsgy9qgu6ncGbUV8J22AacWHiKXg\/s1036\/image1.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"491\" data-original-width=\"1036\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjSlRhZycuNyJePf45hjI8TYSDe5Xbn1Uj9R0DobtZLRy8nTScnl-V7f-Zti6qPpprSHLOac5HqavO8JWg1fy6_0VisA40LVyXAv9MzHQm3Xvkr9WyuktlOqbfga3uaVOCVlhoxGTOZ1qWWznWIvf6NcMC1UgmnDUVsgy9qgu6ncGbUV8J22AacWHiKXg\/s16000\/image1.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">The ability to override semantic priors when presented with flipped in-context example labels emerges with model scale. Smaller models cannot flip predictions to follow flipped labels (performance only decreases slightly), while larger models can do so (performance decreases to well below 50%).<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nWe found that when no labels are flipped, larger models have better performance than smaller models (as expected). But when we flip more and more labels, the performance of small models stays relatively flat, but large models experience large performance drops to well-below random guessing (e.g., 90% \u2192 22.5% for code-davinci-002).\n<\/p>\n<p>\nThese results indicate that large models can override prior knowledge from pre-training when contradicting input-label mappings are presented in-context. Small models can\u2019t do this, making this ability an emergent phenomena of model scale.\n<\/p>\n<h2>Semantically-unrelated labels<\/h2>\n<p>\nIn this experiment, we replace labels with semantically-irrelevant ones (e.g., for sentiment analysis, we use \u201cfoo\/bar\u201d instead of \u201cnegative\/positive\u201d), which means that the model can only perform ICL by learning from input-label mappings. If a model mostly relies on prior knowledge for ICL, then its performance should decrease after this change since it will no longer be able to use semantic meanings of labels to make predictions. A model that can learn input\u2013label mappings in-context, on the other hand, would be able to learn these semantically-unrelated mappings and should not experience a major drop in performance.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEisGHKYzD0d4B8ynJIhqpU6wedtvu37pMJY7Dd02xWELTodKiDCcXWfu0kQ926XJJWheXRbA7XMMAYP1C3PpY7b9X0N-yfLzVcKwI5nSE4rjOdH1UKBLs_e_e4wF8KXQ7ogywNXn5htE-bWeSde7FbJ9JYLSbLVrll3YTfgIQMTKUtdKAMtZ-Zo2WR1hQ\/s1049\/image4.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"430\" data-original-width=\"1049\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEisGHKYzD0d4B8ynJIhqpU6wedtvu37pMJY7Dd02xWELTodKiDCcXWfu0kQ926XJJWheXRbA7XMMAYP1C3PpY7b9X0N-yfLzVcKwI5nSE4rjOdH1UKBLs_e_e4wF8KXQ7ogywNXn5htE-bWeSde7FbJ9JYLSbLVrll3YTfgIQMTKUtdKAMtZ-Zo2WR1hQ\/s16000\/image4.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">Small models rely more on semantic priors than large models do, as indicated by the greater decrease in performance for small models than for large models when using semantically-unrelated labels (i.e., targets) instead of natural language labels. For each plot, models are shown in order of increasing model size (e.g., for GPT-3 models, <em>a<\/em> is smaller than <em>b<\/em>, which is smaller than <em>c<\/em>).<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nIndeed, we see that using semantically-unrelated labels results in a greater performance drop for small models. This suggests that smaller models primarily rely on their semantic priors for ICL rather than learning from the presented input-label mappings. Large models, on the other hand, have the ability to learn input-label mappings in-context when the semantic nature of labels is removed.\n<\/p>\n<p>\nWe also find that including more in-context examples (i.e., exemplars) results in a greater performance improvement for large models than it does for small models, indicating that large models are better at learning from in-context examples than small models are.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEj747MlBNxtirznhOm0RR4P1kEpk97WOclz3N3TMMTBUBgcciB3MuHopHH0UT4I5qOQPUJ1O8n0M0zRr9sePxpecLSnoyb9foa6Ho5qh_xWtkSzeXyTg-VTRL1hTAc_EIXWewymn6vsOCR96SOidHMHxsu-AjsPTFEVwpNT5HZ954zE2AVIuL0qCXmrvw\/s825\/image6.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"417\" data-original-width=\"825\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEj747MlBNxtirznhOm0RR4P1kEpk97WOclz3N3TMMTBUBgcciB3MuHopHH0UT4I5qOQPUJ1O8n0M0zRr9sePxpecLSnoyb9foa6Ho5qh_xWtkSzeXyTg-VTRL1hTAc_EIXWewymn6vsOCR96SOidHMHxsu-AjsPTFEVwpNT5HZ954zE2AVIuL0qCXmrvw\/s16000\/image6.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">In the SUL-ICL setup, larger models benefit more from additional examples than smaller models do.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Instruction tuning<\/h2>\n<p>\n<a href=\"https:\/\/ai.googleblog.com\/2021\/10\/introducing-flan-more-generalizable.html\">Instruction tuning<\/a> is a popular technique for improving model performance, which involves tuning models on various NLP tasks that are phrased as instructions (e.g., \u201cQuestion: What is the sentiment of the following sentence, \u2018This movie is great.\u2019 Answer: Positive\u201d). Since the process uses natural language labels, however, an open question is whether it improves the ability to learn input-label mappings or whether it strengthens the ability to recognize and apply semantic prior knowledge. Both of these would lead to an improvement in performance on standard ICL tasks, so it\u2019s unclear which of these occur.\n<\/p>\n<p>\nWe study this question by running the same two setups as before, only this time we focus on comparing standard language models (specifically, PaLM) with their instruction-tuned variants (Flan-PaLM).\n<\/p>\n<p>\nFirst, we find that Flan-PaLM is better than PaLM when we use semantically-unrelated labels. This effect is very prominent in small models, as Flan-PaLM-8B outperforms PaLM-8B by 9.6% and almost catches up to PaLM-62B. This trend suggests that instruction tuning strengthens the ability to learn input-label mappings, which isn\u2019t particularly surprising.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEgeFBd7p-muxUg1CTDEkDx4VzH4IG1wtn2z-qW3P0dwSiUu_GS-BQW0RSG-WveJl89MwovvYMQL5UN6Ldze6laCzCxSHhkn3uxf5kuzLmFgtU9hPvstPq-4YmdWWoxuHMnazQCOs-F9faQd-AMtxg6zZsxTD6ZGCm42iV8JZrbAWKrA526dHyppLOQR_A\/s573\/image5.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"301\" data-original-width=\"573\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEgeFBd7p-muxUg1CTDEkDx4VzH4IG1wtn2z-qW3P0dwSiUu_GS-BQW0RSG-WveJl89MwovvYMQL5UN6Ldze6laCzCxSHhkn3uxf5kuzLmFgtU9hPvstPq-4YmdWWoxuHMnazQCOs-F9faQd-AMtxg6zZsxTD6ZGCm42iV8JZrbAWKrA526dHyppLOQR_A\/s16000\/image5.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">Instruction-tuned language models are better at learning input\u2013label mappings than pre-training\u2013only language models are.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nMore interestingly, we saw that Flan-PaLM is actually worse than PaLM at following flipped labels, meaning that the instruction tuned models were unable to override their prior knowledge (Flan-PaLM models don\u2019t reach below random guessing with 100% flipped labels, but PaLM models without instruction tuning can reach 31% accuracy in the same setting). These results indicate that instruction tuning must increase the extent to which models rely on semantic priors when they\u2019re available.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEikUB8OahGEvoe32aO4ZRaTg0rFSsC69cso4eJS1FIV4BOTTLJi93HQ3e4lUnPDJcea98tBsVJnjh8-CgZ10lBtSl1UlNiY6-hHotYq_ow2TEUmcb1tj9NaAFRWxaDTYO1_K0y6bgTg5BvNdihcvHsd78zk3Mn4jwFic5gdEvYn5Ol-JIRmYehgoHtfrg\/s1016\/image2.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"270\" data-original-width=\"1016\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEikUB8OahGEvoe32aO4ZRaTg0rFSsC69cso4eJS1FIV4BOTTLJi93HQ3e4lUnPDJcea98tBsVJnjh8-CgZ10lBtSl1UlNiY6-hHotYq_ow2TEUmcb1tj9NaAFRWxaDTYO1_K0y6bgTg5BvNdihcvHsd78zk3Mn4jwFic5gdEvYn5Ol-JIRmYehgoHtfrg\/s16000\/image2.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">Instruction-tuned models are worse than pre-training\u2013only models at learning to override semantic priors when presented with flipped labels in-context.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nCombined with the previous result, we conclude that although instruction tuning improves the ability to learn input-label mappings, it strengthens the usage of semantic prior knowledge more.\n<\/p>\n<h2>Conclusion<\/h2>\n<p>\nWe examined the extent to which language models learn in-context by utilizing prior knowledge learned during pre-training versus input-label mappings presented in-context.\n<\/p>\n<p>\nWe first showed that large language models can learn to override prior knowledge when presented with enough flipped labels, and that this ability emerges with model scale. We then found that successfully doing ICL using semantically-unrelated labels is another emergent ability of model scale. Finally, we analyzed instruction-tuned language models and saw that instruction tuning improves the capacity to learn input-label mappings but also strengthens the use of semantic prior knowledge even more.\n<\/p>\n<h2>Future work<\/h2>\n<p>\nThese results underscore how the ICL behavior of language models can change depending on their scale, and that larger language models have an emergent ability to map inputs to many types of labels, a form of reasoning in which input-label mappings can potentially be learned for arbitrary symbols. Future research could help provide insights on why these phenomena occur with respect to model scale.\n<\/p>\n<h2>Acknowledgements<\/h2>\n<p>\n<em>This work was conducted by Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, and Tengyu Ma. We would like to thank Sewon Min and our fellow collaborators at Google Research for their advice and helpful discussions.<\/em><\/p>\n<\/div>\n<p>[ad_2]<br \/>\n<br \/><a href=\"http:\/\/ai.googleblog.com\/2023\/05\/larger-language-models-do-in-context.html\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] Posted by Jerry Wei, Student Researcher, and Denny Zhou, Principal Scientist, Google Research There have recently been<\/p>\n","protected":false},"author":2,"featured_media":545,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-544","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-google-ai"],"_links":{"self":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/544","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/comments?post=544"}],"version-history":[{"count":1,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/544\/revisions"}],"predecessor-version":[{"id":2800,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/544\/revisions\/2800"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media\/545"}],"wp:attachment":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media?parent=544"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/categories?post=544"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/tags?post=544"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}