{"id":887,"date":"2023-11-02T15:43:05","date_gmt":"2023-11-02T15:43:05","guid":{"rendered":"https:\/\/todaysainews.com\/index.php\/2023\/11\/02\/challenges-in-detoxifying-language-models-2\/"},"modified":"2025-04-27T07:30:58","modified_gmt":"2025-04-27T07:30:58","slug":"challenges-in-detoxifying-language-models-2","status":"publish","type":"post","link":"https:\/\/todaysainews.com\/index.php\/2023\/11\/02\/challenges-in-detoxifying-language-models-2\/","title":{"rendered":"Challenges in Detoxifying Language Models"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div>\n<div class=\"article-cover\">\n<div class=\"article-cover__header\">\n<p class=\"article-cover__eyebrow glue-label\">Research<\/p>\n<dl class=\"article-cover__meta\">\n<dt class=\"glue-visually-hidden\">Published<\/dt>\n<dd class=\"article-cover__date glue-label\">\n              <time datetime=\"2021-09-15\"><br \/>\n                15 September 2021<br \/>\n              <\/time>\n            <\/dd>\n<dt class=\"glue-visually-hidden\">Authors<\/dt>\n<dd class=\"article-cover__authors\">\n<p data-block-key=\"d8x6o\">Johannes Welbl, Mia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson *, Pushmeet Kohli, Ben Coppin, Po-Sen Huang (*\u00a0External authors )<\/p>\n<\/dd>\n<\/dl>\n<section class=\"glue-social glue-social--zippy share share--left article-cover__share\" data-glue-expansion-panel-expand-tooltip=\"Share: Expand to see social channels\" data-glue-expansion-panel-collapse-tooltip=\"Share: Hide social channels\" id=\"share-e221834b-d013-4633-8b05-3f3c09637447\">\n<\/section><\/div>\n<\/p><\/div>\n<div class=\"gdm-rich-text rich-text\">\n<h2 data-block-key=\"qh5xs\">Undesired Behavior from Language Models<\/h2>\n<p data-block-key=\"bwnca\">Language models trained on large text corpora can generate <a href=\"https:\/\/cdn.openai.com\/better-language-models\/language_models_are_unsupervised_multitask_learners.pdf\" rel=\"noopener\" target=\"_blank\">fluent text<\/a>, and show promise as <a href=\"https:\/\/arxiv.org\/abs\/2005.14165\" rel=\"noopener\" target=\"_blank\">few\/zero shot learners<\/a> and code generation tools, amongst other capabilities. However, prior research has also identified several issues with LM use that should be addressed, including <a href=\"https:\/\/arxiv.org\/abs\/1911.03064\" rel=\"noopener\" target=\"_blank\">distributional biases<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2004.09456\" rel=\"noopener\" target=\"_blank\">social stereotypes<\/a>, potentially revealing <a href=\"https:\/\/arxiv.org\/abs\/2012.07805\" rel=\"noopener\" target=\"_blank\">training samples<\/a>, and other <a href=\"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3442188.3445922\" rel=\"noopener\" target=\"_blank\">possible LM harms<\/a>. One particular type of LM harm is the generation of <a href=\"https:\/\/arxiv.org\/abs\/2009.11462\" rel=\"noopener\" target=\"_blank\">toxic language<\/a>, which includes hate speech, insults, profanities and threats.<\/p>\n<p data-block-key=\"ih8ce\">In our paper, we focus on LMs and their <a href=\"https:\/\/arxiv.org\/abs\/2009.11462\" rel=\"noopener\" target=\"_blank\">propensity<\/a> to generate toxic language. We study the effectiveness of different methods to mitigate LM toxicity, and their side-effects, and we investigate the reliability and limits of classifier-based automatic toxicity evaluation.<\/p>\n<p data-block-key=\"hjye3\">Following the definition of toxicity developed by <a href=\"https:\/\/perspectiveapi.com\/\" rel=\"noopener\" target=\"_blank\">Perspective API<\/a>, we here consider an utterance to be <i>toxic if it is rude, disrespectful, or unreasonable language that is likely to make someone leave a discussion<\/i>. However, we note two important caveats. First, toxicity judgements are subjective\u2014they depend both on the raters evaluating toxicity and their cultural background, as well as the inferred context. While not the focus of this work, it is important for future work to continue to develop this above definition, and clarify how it can be fairly applied in different contexts. Second, we note that toxicity covers only one aspect of possible LM harms, excluding e.g. harms arising from distributional model bias.<\/p>\n<h2 data-block-key=\"xxd8o\">Measuring and Mitigating Toxicity<\/h2>\n<p data-block-key=\"yv71r\">To enable safer language model use, we set out to measure, understand the origins of, and mitigate toxic text generation in LMs. There has been prior work which has considered various approaches towards reducing LM toxicity, either by <a href=\"https:\/\/arxiv.org\/abs\/2004.10964\" rel=\"noopener\" target=\"_blank\">fine-tuning<\/a> <a href=\"https:\/\/arxiv.org\/abs\/2009.11462\" rel=\"noopener\" target=\"_blank\">pre-trained LMs<\/a>, by <a href=\"https:\/\/openreview.net\/forum?id=H1edEyBKDS\" rel=\"noopener\" target=\"_blank\">steering model generations<\/a>, or through direct <a href=\"https:\/\/arxiv.org\/abs\/2104.06390\" rel=\"noopener\" target=\"_blank\">test-time filtering<\/a>. Further, prior <a href=\"https:\/\/arxiv.org\/abs\/2009.11462\" rel=\"noopener\" target=\"_blank\">work<\/a> has introduced automatic metrics for measuring LM toxicity, both when prompted with different kinds of prompts, as well as in unconditional generation. These metrics rely on the toxicity scores of the widely used <a href=\"https:\/\/perspectiveapi.com\/\" rel=\"noopener\" target=\"_blank\">Perspective API<\/a> model, which is trained on online comments annotated for toxicity.<\/p>\n<p data-block-key=\"f7k2w\">In our study we first show that a combination of relatively simple baselines leads to a drastic reduction, as measured by previously introduced LM toxicity <a href=\"https:\/\/arxiv.org\/abs\/2009.11462\" rel=\"noopener\" target=\"_blank\">metrics<\/a>. Concretely, we find that a combination of i) filtering the LM training data annotated as toxic by <a href=\"https:\/\/perspectiveapi.com\/\" rel=\"noopener\" target=\"_blank\">Perspective API<\/a>, ii) filtering generated text for toxicity based on a separate, fine-tuned BERT classifier trained to detect toxicity, and iii) <a href=\"https:\/\/arxiv.org\/abs\/1912.02164\" rel=\"noopener\" target=\"_blank\">steering<\/a> the generation towards being less toxic, is highly effective at reducing LM toxicity, as measured by automatic toxicity metrics. When prompted with toxic (or non-toxic) prompts from the <a href=\"https:\/\/arxiv.org\/abs\/2009.11462\" rel=\"noopener\" target=\"_blank\">RealToxicityPrompts<\/a> dataset, we see a 6-fold (or 17-fold) reduction compared with the previously reported state-of-the-art, in the aggregate <i>Probability of Toxicity<\/i> metric. We reach a value of zero in the unprompted text generation setting, suggesting that we have exhausted this metric. Given how low the toxicity levels are in absolute terms, as measured with automatic metrics, the question arises to what extent this is also reflected in human judgment, and whether improvements on these metrics are still meaningful, especially since they are derived from an imperfect automatic classification system. To gather further insights, we turn towards evaluation by humans.<\/p>\n<h2 data-block-key=\"pl6k9\">Evaluation by Humans<\/h2>\n<p data-block-key=\"wkiz6\">We conduct a human evaluation study where raters annotate LM-generated text for toxicity. The results of this study indicate that there is a direct and largely monotonic relation between average human and classifier-based results, and LM toxicity reduces according to human judgment.<\/p>\n<\/div>\n<figure class=\"single-media single-media--inline\">\n<\/figure>\n<div class=\"gdm-rich-text rich-text\">\n<p data-block-key=\"2o2tq\">We found inter-annotator agreement comparable to other studies measuring toxicity, and that annotating toxicity has aspects that are subjective and ambiguous. For example, we found that ambiguity frequently arose as a result of sarcasm, news-style text about violent behavior, and quoting toxic text (either neutrally or in order to disagree with it).<\/p>\n<p data-block-key=\"y6cxs\">In addition, we find that automatic evaluation of LM toxicity becomes less reliable once detoxification measures have been applied. While initially coupled very well, for samples with a high (automatic) toxicity score, the link between human ratings and Perspective API scores disappears once we apply and increase the strength of LM toxicity reduction interventions.<\/p>\n<\/div>\n<figure class=\"single-media single-media--inline\">\n<\/figure>\n<div class=\"gdm-rich-text rich-text\">\n<p data-block-key=\"treuz\">Further manual inspection also reveals that false positive texts mention some identity terms at disproportionate frequencies. For example, for one detoxified model, we observe that within the high automatic toxicity bucket, 30.2% of texts mention the word \u201cgay\u201d, reflecting previously observed biases in automatic toxicity classifiers (which the community is already <a href=\"https:\/\/research.google\/pubs\/pub46743\/\" rel=\"noopener\" target=\"_blank\">working on<\/a> improving). Together, these findings suggest that when judging LM toxicity, a reliance on automatic metrics alone could lead to potentially misleading interpretations.<\/p>\n<h2 data-block-key=\"6jz0n\">Unintended Consequences of Detoxification<\/h2>\n<p data-block-key=\"h17a7\">We further study possible unintended consequences resulting from the LM toxicity reduction interventions. For detoxified language models, we see a marked increase in the language modeling loss, and this increase correlates with the strength of the detoxification intervention. However, the increase is larger on documents that have higher automatic toxicity scores, compared to documents with lower toxicity scores. At the same time, in our human evaluations we did not find notable differences in terms of grammar, comprehension, and in how well the style of prior conditioning text is preserved.<\/p>\n<p data-block-key=\"5ahlw\">Another consequence of detoxification is that it can disproportionately reduce the ability of the LM to model texts related to certain identity groups <i>(i.e. topic coverage)<\/i>, and also text by people from different identity groups and with different dialects <i>(i.e. dialect coverage)<\/i>. We find that there is a larger increase in the language modeling loss for text in African-American English (AAE) when compared to text in White-Aligned English.<\/p>\n<\/div>\n<figure class=\"single-media single-media--inline\">\n<\/figure>\n<div class=\"gdm-rich-text rich-text\">\n<p data-block-key=\"1h538\">We see similar disparities in LM-loss degradation for text related to female actors when compared to text about male actors. For text about certain ethnic subgroups (such as Hispanic American), the degradation in performance is again relatively higher when compared to other subgroups.<\/p>\n<\/div>\n<figure class=\"single-media single-media--inline\">\n<\/figure>\n<div class=\"gdm-rich-text rich-text\">\n<h2 data-block-key=\"dboh7\">Takeaways<\/h2>\n<p data-block-key=\"si0nb\">Our experiments on measuring and mitigating language model toxicity provide us valuable insights into potential next steps towards reducing toxicity-related language model harms.<\/p>\n<p data-block-key=\"64opn\">From our automated and human evaluation studies, we find that existing mitigation methods are indeed very effective at reducing automatic toxicity metrics, and this improvement is largely matched with reductions in toxicity as judged by humans. However, we might have reached an exhaustion point for the use of automatic metrics in LM toxicity evaluation: after the application of toxicity reduction measures, the majority of remaining samples with high automatic toxicity scores are not actually judged as toxic by human raters, indicating that automatic metrics become less reliable for detoxified LMs. This motivates efforts towards designing more challenging benchmarks for automatic evaluation, and to consider human judgment for future studies on LM toxicity mitigation.<\/p>\n<p data-block-key=\"hx8my\">Further, given the ambiguity in human judgements of toxicity, and noting that judgements can vary across users and applications (e.g. language describing violence, that might otherwise be flagged as toxic, might be appropriate in a news article), future work should continue to develop and adapt the notion of toxicity for different contexts, and refine it for different LM applications. We hope the list of phenomena which we found annotator disagreement for is helpful in this regard.<\/p>\n<p data-block-key=\"4e323\">Finally, we also noticed unintended consequences of LM toxicity mitigation, including a deterioration in LM loss, and an unintended amplification of social biases &#8211; measured in terms of topic and dialect coverage &#8211; potentially leading to decreased LM performance for marginalized groups. Our findings suggest that alongside toxicity, it is key for future work to not rely on just a single metric, but to consider an \u201censemble of metrics\u201d which capture different issues. Future interventions, such as further reducing bias in toxicity classifiers will potentially help prevent trade-offs like the ones we observed, enabling safer language model use.<\/p>\n<\/div>\n<aside class=\"notes\">\n<div class=\"glue-page\">\n<div class=\"gdm-rich-text notes__inner\">\n<h2 data-block-key=\"ngzsn\">Acknowledgements<\/h2>\n<p data-block-key=\"6s5p6\">We would like to thank James Besley, Phil Blunsom, Taylan Cemgil, Sanah Choudhry, Iason Gabriel, Geoffrey Irving, Maribeth Rauh, Sebastian Ruder, and Laura Weidinger for comments and discussion, as well as Lucy Vasserman and Jeffrey Sorensen for providing support with using Perspective API, and for discussing their findings on detecting toxicity.<\/p>\n<\/p><\/div>\n<\/p><\/div>\n<\/aside>\n<aside class=\"related-posts\">\n<\/aside><\/div>\n<p>[ad_2]<br \/>\n<br \/><a href=\"https:\/\/deepmind.google\/discover\/blog\/challenges-in-detoxifying-language-models\/\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] Research Published 15 September 2021 Authors Johannes Welbl, Mia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa<\/p>\n","protected":false},"author":2,"featured_media":765,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[21],"tags":[],"class_list":["post-887","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-deepmind-ai"],"_links":{"self":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/887","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/comments?post=887"}],"version-history":[{"count":1,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/887\/revisions"}],"predecessor-version":[{"id":2610,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/887\/revisions\/2610"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media\/765"}],"wp:attachment":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media?parent=887"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/categories?post=887"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/tags?post=887"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}