{"id":444,"date":"2023-03-15T08:02:22","date_gmt":"2023-03-15T08:02:22","guid":{"rendered":"https:\/\/todaysainews.com\/index.php\/2023\/03\/15\/techniques-for-training-large-neural-networks\/"},"modified":"2025-04-27T07:33:58","modified_gmt":"2025-04-27T07:33:58","slug":"techniques-for-training-large-neural-networks","status":"publish","type":"post","link":"https:\/\/todaysainews.com\/index.php\/2023\/03\/15\/techniques-for-training-large-neural-networks\/","title":{"rendered":"Techniques for training large neural networks"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div>\n<div>\n<p>Pipeline parallelism splits a model \u201cvertically\u201d by layer. It\u2019s also possible to \u201chorizontally\u201d split certain operations within a layer, which is usually called\u00a0<em>Tensor Parallel<\/em>\u00a0training. For many modern models (such as the\u00a0<a href=\"https:\/\/jalammar.github.io\/illustrated-transformer\/\" rel=\"noopener noreferrer\" target=\"_blank\">Transformer<\/a>), the computation bottleneck is multiplying an activation batch matrix with a large weight matrix.\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Matrix_multiplication\" rel=\"noopener noreferrer\" target=\"_blank\">Matrix multiplication<\/a>\u00a0can be thought of as dot products between pairs of rows and columns; it\u2019s possible to compute independent dot products on different GPUs, or to compute parts of each dot product on different GPUs and sum up the results. With either strategy, we can slice the weight matrix into even-sized \u201cshards\u201d, host each shard on a different GPU, and use that shard to compute the relevant part of the overall matrix product before later communicating to combine the\u00a0results.<\/p>\n<p>One example is\u00a0<a href=\"https:\/\/nv-adlr.github.io\/MegatronLM\" rel=\"noopener noreferrer\" target=\"_blank\">Megatron-LM<\/a>, which parallelizes matrix multiplications within the Transformer\u2019s self-attention and MLP layers.\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2104.04473\" rel=\"noopener noreferrer\" target=\"_blank\">PTD-P<\/a>\u00a0uses tensor, data, and pipeline parallelism; its pipeline schedule assigns multiple non-consecutive layers to each device, reducing bubble overhead at the cost of more network\u00a0communication.<\/p>\n<p>Sometimes the input to the network can be parallelized across a dimension with a high degree of parallel computation relative to cross-communication.\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2205.05198\" rel=\"noopener noreferrer\" target=\"_blank\">Sequence parallelism<\/a>\u00a0is one such idea, where an input sequence is split across time into multiple sub-examples, proportionally decreasing peak memory consumption by allowing the computation to proceed with more granularly-sized\u00a0examples.<br class=\"softbreak\"\/><\/p>\n<\/div>\n<\/div>\n<p>[ad_2]<br \/>\n<br \/><a href=\"https:\/\/openai.com\/research\/techniques-for-training-large-neural-networks\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] Pipeline parallelism splits a model \u201cvertically\u201d by layer. It\u2019s also possible to \u201chorizontally\u201d split certain operations within<\/p>\n","protected":false},"author":2,"featured_media":445,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[19],"tags":[],"class_list":["post-444","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-openai"],"_links":{"self":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/444","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/comments?post=444"}],"version-history":[{"count":1,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/444\/revisions"}],"predecessor-version":[{"id":2850,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/444\/revisions\/2850"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media\/445"}],"wp:attachment":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media?parent=444"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/categories?post=444"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/tags?post=444"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}