{"id":590,"date":"2023-06-09T20:11:31","date_gmt":"2023-06-09T20:11:31","guid":{"rendered":"https:\/\/todaysainews.com\/index.php\/2023\/06\/09\/advancing-and-evaluating-text-guided-image-inpainting-google-ai-blog\/"},"modified":"2025-04-27T07:33:22","modified_gmt":"2025-04-27T07:33:22","slug":"advancing-and-evaluating-text-guided-image-inpainting-google-ai-blog","status":"publish","type":"post","link":"https:\/\/todaysainews.com\/index.php\/2023\/06\/09\/advancing-and-evaluating-text-guided-image-inpainting-google-ai-blog\/","title":{"rendered":"Advancing and evaluating text-guided image inpainting \u2013 Google AI Blog"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div id=\"post-body-2015281937814138473\">\n<span class=\"byline-author\">Posted by Su Wang and Ceslee Montgormery, Research Engineers, Google Research<br \/>\n<\/span><\/p>\n<p><img decoding=\"async\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjCOhgOtXQGmv1getpW3oHgD-4dCvqg9Q_Srs4qnD-FtydmHFb6XEesvUNGkf1eXRemuok7hb58ikl8tSw4xoNSwDd_YZkrdxVgj-6Fb5AeM6DkRB32URqFjpdzLYZaOtjjHcOqXzDmh7KdshdtaNtU1cVgv69UbvwW-v4-yu6h0-XDBe7vyo6PB23dyg\/s320\/Imagen%20Editor%20&amp;%20EditBench%20hero.jpg\" style=\"display: none;\"\/><\/p>\n<p>\nIn the last few years, <a href=\"https:\/\/en.wikipedia.org\/wiki\/Text-to-image_model\">text-to-image generation<\/a> research has seen an explosion of breakthroughs (notably, <a href=\"https:\/\/imagen.research.google\/\">Imagen<\/a>, <a href=\"https:\/\/parti.research.google\/\">Parti<\/a>, <a href=\"https:\/\/cdn.openai.com\/papers\/dall-e-2.pdf\">DALL-E 2<\/a>, etc.) that have naturally permeated into related topics. In particular, text-guided image editing (TGIE) is a practical task that involves editing generated and photographed visuals rather than completely redoing them. Quick, automated, and controllable editing is a convenient solution when recreating visuals would be time-consuming or infeasible (e.g., tweaking objects in vacation photos or perfecting fine-grained details on a cute pup generated from scratch). Further, TGIE represents a substantial opportunity to improve training of foundational models themselves. Multimodal models require diverse data to train properly, and TGIE editing can enable the generation and recombination of high-quality and scalable synthetic data that, perhaps most importantly, can provide methods to optimize the distribution of training data along any given axis.\n<\/p>\n<p><a name=\"more\"\/><\/p>\n<p>\nIn \u201c<a href=\"https:\/\/arxiv.org\/abs\/2212.06909\">Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting<\/a>\u201d, to be presented at <a href=\"https:\/\/cvpr2023.thecvf.com\/\">CVPR 2023<\/a>, we introduce <a href=\"https:\/\/imagen.research.google\/editor\/\">Imagen Editor<\/a>, a state-of-the-art solution for the task of masked <a href=\"https:\/\/en.wikipedia.org\/wiki\/Inpainting\">inpainting<\/a> \u2014 i.e., when a user provides text instructions alongside an overlay or \u201cmask\u201d (usually generated within a drawing-type interface) indicating the area of the image they would like to modify. We also introduce <a href=\"https:\/\/imagen.research.google\/editor\/\">EditBench<\/a>, a method that gauges the quality of image editing models. EditBench goes beyond the <a href=\"https:\/\/arxiv.org\/abs\/2104.08718\">commonly<\/a> <a href=\"https:\/\/openreview.net\/pdf?id=bKBhQhPeKaF\">used<\/a> <a href=\"https:\/\/arxiv.org\/abs\/1910.13321\">coarse-grained<\/a> \u201cdoes this image match this text\u201d methods, and drills down to various types of attributes, objects, and scenes for a more fine-grained understanding of model performance. In particular, it puts strong emphasis on the faithfulness of image-text alignment without losing sight of image quality.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/1.bp.blogspot.com\/-2ufunOk9RJY\/ZINoiOxEhUI\/AAAAAAAAMVw\/b-PPxjdzuXQ-wz6sgTZ1Iv2bQaBhAoqNACNcBGAsYHQ\/s1261\/image2.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"306\" data-original-width=\"1261\" src=\"https:\/\/1.bp.blogspot.com\/-2ufunOk9RJY\/ZINoiOxEhUI\/AAAAAAAAMVw\/b-PPxjdzuXQ-wz6sgTZ1Iv2bQaBhAoqNACNcBGAsYHQ\/s16000\/image2.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">Given an image, a user-defined mask, and a text prompt, Imagen Editor makes localized edits to the designated areas. The model meaningfully incorporates the user\u2019s intent and performs photorealistic edits.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><\/p>\n<h2>Imagen Editor<\/h2>\n<p>\nImagen Editor is a <a href=\"https:\/\/proceedings.neurips.cc\/paper\/2021\/hash\/940392f5f32a7ade1cc201767cf83e31-Abstract.html\">diffusion-based model<\/a> fine-tuned on <a href=\"https:\/\/arxiv.org\/abs\/2205.11487\">Imagen<\/a> for editing. It targets improved representations of linguistic inputs, fine-grained control and high-fidelity outputs. Imagen Editor takes three inputs from the user: 1) the image to be edited, 2) a binary mask to specify the edit region, and 3) a text prompt \u2014 all three inputs guide the output samples.\n<\/p>\n<p>\nImagen Editor depends on three core techniques for high-quality text-guided image inpainting. First, unlike prior inpainting models (e.g., <a href=\"https:\/\/arxiv.org\/abs\/2111.05826\">Palette<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/1801.07892\">Context Attention<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/1806.03589\">Gated Convolution<\/a>) that apply random box and stroke masks, Imagen Editor employs an object detector masking policy with an <a href=\"https:\/\/arxiv.org\/abs\/1801.04381\">object detector module<\/a> that produces object masks during training. Object masks are based on detected objects rather than random patches and allow for more principled alignment between edit text prompts and masked regions. Empirically, the method helps the model stave off the prevalent issue of the text prompt being ignored when masked regions are small or only partially cover an object (e.g., <a href=\"https:\/\/arxiv.org\/abs\/2204.14217\">CogView2<\/a>).\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEgzaJwAUNaa0qyyGmGb9m0eJi2w7iRnFJpJzIXMwPpj-geuz5wnwC5QOBaXYCnSOgCrqxkU8GwVkdjaH6oxBxwafDPfSWVkID53dhTkkrv_kDcXRN0vemMoT8tEMQET6v4wL0l6pRoRCvgjmhD3u6Myi9H4qbNClFOBpU4I0QbgVCALEKZLd8RUvVkWtg\/s648\/image9.png\" style=\"margin-left: auto; margin-right: auto;\"><img fetchpriority=\"high\" decoding=\"async\" border=\"0\" data-original-height=\"332\" data-original-width=\"648\" height=\"328\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEgzaJwAUNaa0qyyGmGb9m0eJi2w7iRnFJpJzIXMwPpj-geuz5wnwC5QOBaXYCnSOgCrqxkU8GwVkdjaH6oxBxwafDPfSWVkID53dhTkkrv_kDcXRN0vemMoT8tEMQET6v4wL0l6pRoRCvgjmhD3u6Myi9H4qbNClFOBpU4I0QbgVCALEKZLd8RUvVkWtg\/w640-h328\/image9.png\" width=\"640\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">Random masks (<strong>left<\/strong>) frequently capture background or intersect object boundaries, defining regions that can be plausibly inpainted just from image context alone. Object masks (<strong>right<\/strong>) are harder to inpaint from image context alone, encouraging models to rely more on text inputs during training.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nNext, during training and inference, Imagen Editor enhances high resolution editing by conditioning on full resolution (1024\u00d71024 in this work), channel-wise concatenation of the input image and the mask (similar to <a href=\"https:\/\/ieeexplore.ieee.org\/stamp\/stamp.jsp?arnumber=9887996\">SR3<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2111.05826\">Palette<\/a>, and <a href=\"https:\/\/arxiv.org\/abs\/2112.10741\">GLIDE<\/a>). For the base diffusion 64\u00d764 model and the 64\u00d764\u2192256\u00d7256 super-resolution models, we apply a parameterized downsampling convolution (e.g., convolution with a stride), which we empirically find to be critical for high fidelity.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEiwttcT7FlCELkjzvBAcSHQofxyhoQg61jfOeKYeWHupQr1gIU7ai9Midirr1zU1kdgnP_2Hb47LsWBx6QQhPfmkF2m6fw-KaH4j7woDh1RgXTS3sPV31m0NnBka_1JkEdhU3b8pTO7MKuVIZvPWwy9X3myP6x17wfO4f3AZWKVy-LhEgCyQF8qMkg4hg\/s646\/image4.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"365\" data-original-width=\"646\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEiwttcT7FlCELkjzvBAcSHQofxyhoQg61jfOeKYeWHupQr1gIU7ai9Midirr1zU1kdgnP_2Hb47LsWBx6QQhPfmkF2m6fw-KaH4j7woDh1RgXTS3sPV31m0NnBka_1JkEdhU3b8pTO7MKuVIZvPWwy9X3myP6x17wfO4f3AZWKVy-LhEgCyQF8qMkg4hg\/s16000\/image4.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">Imagen is fine-tuned for image editing. All of the diffusion models, i.e., the base model and super-resolution (SR) models, are conditioned on high-resolution 1024\u00d71024 image and mask inputs. To this end, new convolutional image encoders are introduced.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nFinally, at inference we apply <a href=\"https:\/\/arxiv.org\/abs\/2207.12598\">classifier-free guidance<\/a> (CFG) to bias samples to a particular conditioning, in this case, text prompts. CFG interpolates between the text-conditioned and unconditioned model predictions to ensure strong alignment between the generated image and the input text prompt for text-guided image inpainting. We follow <a href=\"https:\/\/arxiv.org\/abs\/2210.02303\">Imagen Video<\/a> and use high guidance weights with guidance oscillation (a guidance schedule that oscillates within a value range of guidance weights). In the base model (the stage-1 64x diffusion), where ensuring strong alignment with text is most critical, we use a guidance weight schedule that oscillates between 1 and 30. We observe that high guidance weights combined with oscillating guidance result in the best trade-off between sample fidelity and text-image alignment.\n<\/p>\n<h2>EditBench<\/h2>\n<p>\nThe EditBench dataset for text-guided image inpainting evaluation contains 240 images, with 120 generated and 120 natural images. Generated images are synthesized by <a href=\"https:\/\/parti.research.google\/\">Parti<\/a> and natural images are drawn from the <a href=\"https:\/\/arxiv.org\/abs\/1602.07332\">Visual Genome<\/a> and <a href=\"https:\/\/arxiv.org\/abs\/1811.00982\">Open Images<\/a> datasets. EditBench captures a wide variety of language, image types, and levels of text prompt specificity (i.e., simple, rich, and full captions). Each example consists of (1) a masked input image, (2) an input text prompt, and (3) a high-quality output image used as reference for automatic metrics. To provide insight into the relative strengths and weaknesses of different models, EditBench prompts are designed to test fine-grained details along three categories: (1) attributes (e.g., material, color, shape, size, count); (2) object types (e.g., common, rare, text rendering); and (3) scenes (e.g., indoor, outdoor, realistic, or paintings). To understand how different specifications of prompts affect model performance, we provide three text prompt types: a single-attribute (Mask Simple) or a multi-attribute description of the masked object (Mask Rich) \u2013 or an entire image description (Full Image). Mask Rich, especially, probes the models\u2019 ability to handle complex attribute binding and inclusion.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEiLRZwI03dQY3RtegS-3jMmg7xUSpcRXRkkqPdWkuuHFQeHi-EI_Uk6tGVlxdJlpquIdZkRXxIRGJAV1uau9ISKFksXjJNZwm_LrLXzEqwsi4Kkb_Q4eHRt2RZTkEsf0Tlng9MDzv0VkRq4-CzOpDAELm_xnDhzcGGbmZJYhwNYj7MNxsc0V-57SKCI7g\/s531\/image1.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"321\" data-original-width=\"531\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEiLRZwI03dQY3RtegS-3jMmg7xUSpcRXRkkqPdWkuuHFQeHi-EI_Uk6tGVlxdJlpquIdZkRXxIRGJAV1uau9ISKFksXjJNZwm_LrLXzEqwsi4Kkb_Q4eHRt2RZTkEsf0Tlng9MDzv0VkRq4-CzOpDAELm_xnDhzcGGbmZJYhwNYj7MNxsc0V-57SKCI7g\/s16000\/image1.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">The full image is used as a reference for successful inpainting. The mask covers the target object with a free-form, non-hinting shape. We evaluate Mask Simple, Mask Rich and Full Image prompts, consistent with conventional text-to-image models.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nDue to the intrinsic weaknesses in existing automatic evaluation metrics (<a href=\"https:\/\/arxiv.org\/abs\/2104.08718\">CLIPScore<\/a> and <a href=\"https:\/\/datasets-benchmarks-proceedings.neurips.cc\/paper\/2021\/file\/0a09c8844ba8f0936c20bd791130d6b6-Paper-round1.pdf\">CLIP-R-Precision<\/a>) for TGIE, we hold human evaluation as the gold standard for EditBench. In the section below, we demonstrate how EditBench is applied to model evaluation.\n<\/p>\n<h2>Evaluation <\/h2>\n<p>\nWe evaluate the Imagen Editor model \u2014 with object masking (IM) and with random masking (IM-RM) \u2014 against comparable models, <a href=\"https:\/\/arxiv.org\/abs\/2112.10752\">Stable Diffusion<\/a> (SD) and <a href=\"https:\/\/cdn.openai.com\/papers\/dall-e-2.pdf\">DALL-E 2<\/a> (DL2). Imagen Editor outperforms these models by substantial margins across all EditBench evaluation categories.\n<\/p>\n<p>\nFor Full Image prompts, <em>single-image human evaluation<\/em> provides binary answers to confirm if the image matches the caption. For Mask Simple prompts, single-image human evaluation confirms if the object and attribute are properly rendered, and bound correctly (e.g., for a red cat, a white cat on a red table would be an incorrect binding). <em>Side-by-side human evaluation<\/em> uses Mask Rich prompts only for side-by-side comparisons between IM and each of the other three models (IM-RM, DL2, and SD), and indicates which image matches with the caption better for text-image alignment, and which image is most realistic.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEgVL1S0kmbq-xo-aP2FJBTNaBV1z9v8rUgXoFjru__B9DqVhYKjgOwTiXFuUtFo3idjA7ZVJnNteY1VYKL-leNOiKNayiQkwzXlRjnTpJbtM-gHUeRiCOFB1VLKau0NXmCRQktQreonSujMjEtwZZbJlVSLvaBNSk_X4La8wa2i_quAsWBWCElUqEQFPA\/s800\/image7.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"703\" data-original-width=\"800\" height=\"562\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEgVL1S0kmbq-xo-aP2FJBTNaBV1z9v8rUgXoFjru__B9DqVhYKjgOwTiXFuUtFo3idjA7ZVJnNteY1VYKL-leNOiKNayiQkwzXlRjnTpJbtM-gHUeRiCOFB1VLKau0NXmCRQktQreonSujMjEtwZZbJlVSLvaBNSk_X4La8wa2i_quAsWBWCElUqEQFPA\/w640-h562\/image7.png\" width=\"640\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">Human evaluation. Full Image prompts elicit annotators\u2019 overall impression of text-image alignment; Mask Simple and Mask Rich check for the correct inclusion of particular attributes, objects and attribute binding.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nFor single-image human evaluation, IM receives the highest ratings across-the-board (10\u201313% higher than the 2nd-highest performing model). For the rest, the performance order is IM-RM &gt; DL2 &gt; SD (with 3\u20136% difference) except for with Mask Simple, where IM-RM falls 4-8% behind. As relatively more semantic content is involved in Full and Mask Rich, we conjecture IM-RM and IM are benefited by the higher performing <a href=\"https:\/\/ai.googleblog.com\/2020\/02\/exploring-transfer-learning-with-t5.html\">T5 XXL<\/a> text encoder.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEimKlJIq6ksgqXxJZ2aj3_6VYpD9G9YGzYiuTaeyY1wXvL0pUSz0-5WQRLRWXrLLau1nUQr-yPvSFxHJyEX698wRZGUWN3TyxR_CzR9CEQtIBJyv-2wXWqJLFJOILp6hthc7bEfXbwcMEJD1Ngu1G1eKCkvWcpfdmdWbIpAiGMDFtoCj4wU5652UzuLFA\/s1117\/image5.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"602\" data-original-width=\"1117\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEimKlJIq6ksgqXxJZ2aj3_6VYpD9G9YGzYiuTaeyY1wXvL0pUSz0-5WQRLRWXrLLau1nUQr-yPvSFxHJyEX698wRZGUWN3TyxR_CzR9CEQtIBJyv-2wXWqJLFJOILp6hthc7bEfXbwcMEJD1Ngu1G1eKCkvWcpfdmdWbIpAiGMDFtoCj4wU5652UzuLFA\/s16000\/image5.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">Single-image human evaluations of text-guided image inpainting on EditBench by prompt type. For Mask Simple and Mask Rich prompts, text-image alignment is correct if the edited image accurately includes every attribute and object specified in the prompt, including the correct attribute binding. Note that due to different evaluation designs, Full vs. Mask-only prompts, results are less directly comparable.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nEditBench focuses on fine-grained annotation, so we evaluate models for object and attribute types. For object types, IM leads in all categories, performing 10\u201311% better than the 2nd-highest performing model in common, rare, and text-rendering.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEi_kohtwKjIGlQU_3QpbhGoXStRXgMYeuJ210wPzLhfBca3KcNJyCPE4v9ziO3WZEH3LwuDoKQEb3MBh3Qh-ea7hd1axd91Eckn-w_LbwaxHEkE0SbiJjC6dh2TqgJVliwc1kIK7Z1NbUhVO6kimeCsU4d6LCXLBHRUrRozAL_tve94Lvk-j_8DNcBoHQ\/s1086\/image8.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"579\" data-original-width=\"1086\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEi_kohtwKjIGlQU_3QpbhGoXStRXgMYeuJ210wPzLhfBca3KcNJyCPE4v9ziO3WZEH3LwuDoKQEb3MBh3Qh-ea7hd1axd91Eckn-w_LbwaxHEkE0SbiJjC6dh2TqgJVliwc1kIK7Z1NbUhVO6kimeCsU4d6LCXLBHRUrRozAL_tve94Lvk-j_8DNcBoHQ\/s16000\/image8.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">Single-image human evaluations on EditBench Mask Simple by object type. As a cohort, models are better at object rendering than text-rendering.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nFor attribute types, IM is rated much higher (13\u201316%) than the 2nd highest performing model, except for in count, where DL2 is merely 1% behind.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEiXxQK6jb6aegOKgueFIBnQn4J9J6boT1rSGk7QquOq2BqauFb0idxSzPPpr65gntOfLlA7hjfaWKaRxO-185GsZ0KPlB6gzA3lLffEwCIsC6QAb40VJ6evsGWmxPj1RY5q699eqeN453kM4P_gItkn7u5NSeHw4BPBN1tW_oW5jfEZv7Z5XHCoaS4dTQ\/s1086\/image10.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"607\" data-original-width=\"1086\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEiXxQK6jb6aegOKgueFIBnQn4J9J6boT1rSGk7QquOq2BqauFb0idxSzPPpr65gntOfLlA7hjfaWKaRxO-185GsZ0KPlB6gzA3lLffEwCIsC6QAb40VJ6evsGWmxPj1RY5q699eqeN453kM4P_gItkn7u5NSeHw4BPBN1tW_oW5jfEZv7Z5XHCoaS4dTQ\/s16000\/image10.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">Single-image human evaluations on EditBench Mask Simple by attribute type. Object masking improves adherence to prompt attributes across-the-board (IM vs. IM-RM).<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nSide-by-side compared with other models one-vs-one, IM leads in text alignment with a substantial margin, being preferred by annotators compared to SD, DL2, and IM-RM.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/1.bp.blogspot.com\/-0U0o8UJa8jM\/ZINpLHAJa6I\/AAAAAAAAMWg\/1ieetbzjYAsdLSvNspPcH6RtpMgWdqgwQCNcBGAsYHQ\/s636\/image6.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"319\" data-original-width=\"636\" src=\"https:\/\/1.bp.blogspot.com\/-0U0o8UJa8jM\/ZINpLHAJa6I\/AAAAAAAAMWg\/1ieetbzjYAsdLSvNspPcH6RtpMgWdqgwQCNcBGAsYHQ\/s16000\/image6.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">Side-by-side human evaluation of image realism &amp; text-image alignment on EditBench Mask Rich prompts. For text-image alignment, Imagen Editor is preferred in all comparisons.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nFinally, we illustrate a representative side-by-side comparative for all the models. See the <a href=\"https:\/\/arxiv.org\/pdf\/2212.06909.pdf\">paper<\/a> for more examples.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjxsWO-5sX1zTSDpxq5VImIxVVpmTkY4cNOixPBiwQI8RB8jcOAo7noT3Hq1MCW5aiGGF2W3GeGxB94kSu6aolr7haZS4Oggyv76YC7VwqBkg7QnLqQQ2KiSYOIXoVFJT-y2RXqLQDbbgxolKEBtDVGDDe83RLOST7foOvy3nFpBxWHQIGpqExS1ZoMvw\/s766\/image3.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"766\" data-original-width=\"638\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjxsWO-5sX1zTSDpxq5VImIxVVpmTkY4cNOixPBiwQI8RB8jcOAo7noT3Hq1MCW5aiGGF2W3GeGxB94kSu6aolr7haZS4Oggyv76YC7VwqBkg7QnLqQQ2KiSYOIXoVFJT-y2RXqLQDbbgxolKEBtDVGDDe83RLOST7foOvy3nFpBxWHQIGpqExS1ZoMvw\/s16000\/image3.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">Example model outputs for Mask Simple vs. Mask Rich prompts. Object masking improves Imagen Editor\u2019s fine-grained adherence to the prompt compared to the same model trained with random masking.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><\/p>\n<h2>Conclusion<\/h2>\n<p>\nWe presented Imagen Editor and EditBench, making significant advancements in text-guided image inpainting and the evaluation thereof. Imagen Editor is a text-guided image inpainting fine-tuned from Imagen. EditBench is a comprehensive systematic benchmark for text-guided image inpainting, evaluating performance across multiple dimensions: attributes, objects, and scenes. Note that due to concerns in relation to responsible AI, we are not releasing Imagen Editor to the public. EditBench on the other hand is <a href=\"https:\/\/imagen.research.google\/editor\/\">released in full<\/a> for the benefit of the research community.\n<\/p>\n<h2>Acknowledgments <\/h2>\n<p>\n<em>Thanks to Gunjan Baid, Nicole Brichtova, Sara Mahdavi, Kathy Meier-Hellstern, Zarana Parekh, Anusha Ramesh, Tris Warkentin, Austin Waters, and Vijay Vasudevan for their generous support. We give thanks to Igor Karpov, Isabel Kraus-Liang, Raghava Ram Pamidigantam, Mahesh Maddinala, and all the anonymous human annotators for their  coordination to complete the human evaluation tasks. We are grateful to Huiwen Chang, Austin Tarango, and Douglas Eck for providing paper feedback. Thanks to Erica Moreira and Victor Gomes for help with resource coordination. Finally, thanks to the authors of DALL-E 2 for giving us permission to use their model outputs for research purposes.<\/em>\n<\/p>\n<\/div>\n<p>[ad_2]<br \/>\n<br \/><a href=\"http:\/\/ai.googleblog.com\/2023\/06\/imagen-editor-and-editbench-advancing.html\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] Posted by Su Wang and Ceslee Montgormery, Research Engineers, Google Research In the last few years, text-to-image<\/p>\n","protected":false},"author":2,"featured_media":591,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-590","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-google-ai"],"featured_image_urls":{"full":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/06\/Imagen-Editor-EditBench-hero.jpg",646,630,false],"thumbnail":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/06\/Imagen-Editor-EditBench-hero-150x150.jpg",150,150,true],"medium":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/06\/Imagen-Editor-EditBench-hero-300x293.jpg",300,293,true],"medium_large":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/06\/Imagen-Editor-EditBench-hero.jpg",640,624,false],"large":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/06\/Imagen-Editor-EditBench-hero.jpg",640,624,false],"1536x1536":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/06\/Imagen-Editor-EditBench-hero.jpg",646,630,false],"2048x2048":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/06\/Imagen-Editor-EditBench-hero.jpg",646,630,false],"broadnews-featured":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/06\/Imagen-Editor-EditBench-hero.jpg",646,630,false],"broadnews-large":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/06\/Imagen-Editor-EditBench-hero-646x575.jpg",646,575,true],"broadnews-medium":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/06\/Imagen-Editor-EditBench-hero-590x410.jpg",590,410,true]},"author_info":{"info":["Sanna"]},"category_info":"<a href=\"https:\/\/todaysainews.com\/index.php\/category\/google-ai\/\" rel=\"category tag\">Google AI<\/a>","tag_info":"Google AI","comment_count":"0","_links":{"self":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/590","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/comments?post=590"}],"version-history":[{"count":1,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/590\/revisions"}],"predecessor-version":[{"id":2777,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/590\/revisions\/2777"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media\/591"}],"wp:attachment":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media?parent=590"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/categories?post=590"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/tags?post=590"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}