{"id":364,"date":"2023-02-24T21:45:48","date_gmt":"2023-02-24T21:45:48","guid":{"rendered":"https:\/\/todaysainews.com\/index.php\/2023\/02\/24\/a-vision-language-approach-for-foundational-ui-understanding-google-ai-blog\/"},"modified":"2025-04-27T07:34:18","modified_gmt":"2025-04-27T07:34:18","slug":"a-vision-language-approach-for-foundational-ui-understanding-google-ai-blog","status":"publish","type":"post","link":"https:\/\/todaysainews.com\/index.php\/2023\/02\/24\/a-vision-language-approach-for-foundational-ui-understanding-google-ai-blog\/","title":{"rendered":"A vision-language approach for foundational UI understanding \u2013 Google AI Blog"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div id=\"post-body-2024795523447466512\">\n<span class=\"byline-author\">Posted by Yang Li, Research Scientist, and Gang Li, Software Engineer, Google Research<\/span><\/p>\n<p><img decoding=\"async\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEgW5CB9_RuAX16vtEzEBDTIGPJmNVYYYwG8h5Um3swCiMXMt2j_Y7QYQG2JXHplG_lkYDxtsBH-1BJfw2NoEcwa_y_n2BQQ8eUvwB83ILBi6PykILGwCIF2gp9SPpG7h_ScS0_VlaV8sXXbbfs98NHjNZPdu_UkhwifGUxmvcmK652dB4D4FF4yvJmMQg\/s320\/Spotlight%20hero.jpeg\" style=\"display: none;\"\/><\/p>\n<p>\nThe computational understanding of user interfaces (UI) is a key step towards achieving intelligent UI behaviors. Previously, we investigated various UI modeling tasks, including <a href=\"https:\/\/aclanthology.org\/2020.emnlp-main.443\/\">widget captioning<\/a>, <a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/3472749.3474765\">screen summarization<\/a>, and <a href=\"https:\/\/ai.googleblog.com\/2020\/07\/grounding-natural-language-instructions.html\">command grounding<\/a>, that address diverse interaction scenarios such as automation and accessibility. We also demonstrated how machine learning can help user experience practitioners improve UI quality by diagnosing <a href=\"https:\/\/ai.googleblog.com\/2019\/04\/using-deep-learning-to-improve.html\">tappability confusion<\/a> and <a href=\"https:\/\/arxiv.org\/abs\/2204.02448\">providing insights<\/a> for improving UI design. These works along with those developed by others in the field have showcased how deep neural networks can potentially transform end user experiences and the interaction design practice.\n<\/p>\n<p><a name=\"more\"\/><\/p>\n<p>\nWith these successes in addressing individual UI tasks, a natural question is whether we can obtain foundational understandings of UIs that can benefit specific UI tasks. As our first attempt to answer this question, we developed <a href=\"https:\/\/arxiv.org\/abs\/2112.05692\">a multi-task model<\/a> to address a range of UI tasks simultaneously. Although the work made some progress, a few challenges remain. Previous UI models heavily rely on <a href=\"https:\/\/developer.android.com\/topic\/performance\/rendering\/optimizing-view-hierarchies\">UI view hierarchies<\/a> \u2014 i.e., the structure or metadata of a mobile UI screen like the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Document_Object_Model\">Document Object Model<\/a> for a webpage \u2014 that allow a model to directly acquire detailed information of UI objects on the screen (e.g., their types, text content and positions). This metadata has given previous models advantages over their vision-only counterparts. However, view hierarchies are not always accessible, and are often <a href=\"https:\/\/dl.acm.org\/doi\/abs\/10.1145\/3491102.3502042\">corrupted<\/a> with missing object descriptions or misaligned structure information. As a result, despite the short-term gains from using view hierarchies, it may ultimately hamper the model performance and applicability. In addition, previous models had to deal with heterogeneous information across datasets and UI tasks, which often resulted in complex model architectures that were difficult to scale or generalize across tasks.\n<\/p>\n<p>\nIn \u201c<a href=\"https:\/\/arxiv.org\/abs\/2209.14927\">Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus<\/a>\u201d, accepted for publication at <a href=\"https:\/\/iclr.cc\/\">ICLR 2023<\/a>, we present a vision-only approach that aims to achieve general UI understanding completely from raw pixels. We introduce a unified approach to represent diverse UI tasks, the information for which can be universally represented by two core modalities: vision and language. The vision modality captures what a person would see from a UI screen, and the language modality can be natural language or any token sequences related to the task. We demonstrate that Spotlight substantially improves accuracy on a range of UI tasks, including widget captioning, screen summarization, command grounding and tappability prediction.\n<\/p>\n<p><\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><img fetchpriority=\"high\" decoding=\"async\" height=\"281\" src=\"https:\/\/lh6.googleusercontent.com\/cxW_LigdqmtmaH0FLPvflcBoSZeTCLYF4Ov6iQLqyiP-BzhXoVR0duivJSEvB5YpzSAkZlFrJUJXQ-nGdAX91E6U7D54ToSzUdBq149uUmxBwBX7ajMAIHCWAYHOWs8BAh8vSxY09wNNhe9yJVoeO9-z_aBKptQO10CNLlprWE0wKazPNHxZpY4M4Afz0aqXtK-8wOjRbqtQO-uXzqmRuSynIUw1rh4vkeVf_A\" style=\"margin-left: auto; margin-right: auto; margin-top: 0px;\" width=\"624\"\/><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\"\/><\/tr>\n<\/tbody>\n<\/table>\n<h2>Spotlight Model<\/h2>\n<p>\nThe Spotlight model input includes a tuple of three items: the screenshot, the region of interest on the screen, and the text description of the task. The output is a text description or response about the region of interest. This simple input and output representation of the model is expressive to capture various UI tasks and allows scalable model architectures. This model design allows a spectrum of learning strategies and setups, from task-specific fine-tuning, to multi-task learning and to few-shot learning. The Spotlight model, as illustrated in the above figure, leverages existing architecture building blocks such as <a href=\"https:\/\/ai.googleblog.com\/2020\/12\/transformers-for-image-recognition-at.html\">ViT<\/a> and <a href=\"https:\/\/ai.googleblog.com\/2020\/02\/exploring-transfer-learning-with-t5.html\">T5<\/a> that are pre-trained in the high-resourced, general vision-language domain, which allows us to build on top of the success of these general domain models.\n<\/p>\n<p>\nBecause UI tasks are often concerned with a specific object or area on the screen, which requires a model to be able to focus on the object or area of interest, we introduce a Focus Region Extractor to a vision-language model that enables the model to concentrate on the region in light of the screen context.\n<\/p>\n<p>\nIn particular, we design a Region Summarizer that acquires a latent representation of a screen region based on ViT encodings by using <a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">attention queries<\/a> generated from the bounding box of the region (see <a href=\"https:\/\/arxiv.org\/pdf\/2209.14927.pdf\">paper<\/a> for more details). Specifically, each coordinate (a scalar value, i.e., the left, top, right or bottom) of the bounding box, denoted as a yellow box on the screenshot, is first embedded via a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Multilayer_perceptron\">multilayer perceptron<\/a> (MLP) as a collection of dense vectors, and then fed to a Transformer model along their coordinate-type embedding. The dense vectors and their corresponding coordinate-type embeddings are color coded to indicate their affiliation with each coordinate value. Coordinate queries then attend to screen encodings output by ViT via <a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">cross attention<\/a>, and the final attention output of the Transformer is used as the region representation for the downstream decoding by T5.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEhH5rOgUt9d1JtiJCL19vVM-_MD_Fp81ApQDfLOBL1V9EAmASJludgnb7rQHMf4J4lRBvLlTMsi22laOr3x4C2wFZR8FJ-djLd-vAFKLNwmv48_I-8BKRCV8DXHOcPvCiyCry54OZPttRzk-m9cjf5X2_j937ciECq8aJNqBDfhHz8Sdn02iAnhdDucBw\/s1200\/image3.gif\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"560\" data-original-width=\"1200\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEhH5rOgUt9d1JtiJCL19vVM-_MD_Fp81ApQDfLOBL1V9EAmASJludgnb7rQHMf4J4lRBvLlTMsi22laOr3x4C2wFZR8FJ-djLd-vAFKLNwmv48_I-8BKRCV8DXHOcPvCiyCry54OZPttRzk-m9cjf5X2_j937ciECq8aJNqBDfhHz8Sdn02iAnhdDucBw\/s16000\/image3.gif\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">A target region on the screen is summarized by using its bounding box to query into screen encodings from ViT via attentional mechanisms.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Results<\/h2>\n<p>\nWe pre-train the Spotlight model using two unlabeled datasets (an internal dataset based on <a href=\"https:\/\/www.tensorflow.org\/datasets\/catalog\/c4\">C4 corpus<\/a> and an internal mobile dataset) with 2.5 million mobile UI screens and 80 million web pages. We then separately fine-tune the pre-trained model for each of the four downstream tasks (captioning, summarization, grounding, and tappability). For widget captioning and screen summarization tasks, we report <a href=\"https:\/\/arxiv.org\/abs\/1411.5726\">CIDEr<\/a> scores, which measure how similar a model text description is to a set of references created by human raters. For command grounding, we report accuracy that measures the percentage of times the model successfully locates a target object in response to a user command. For tappability prediction, we report <a href=\"https:\/\/en.wikipedia.org\/wiki\/F-score\">F1 scores<\/a> that measure the model\u2019s ability to tell tappable objects from untappable ones.\n<\/p>\n<p>\nIn this experiment, we compare Spotlight with several benchmark models. <a href=\"https:\/\/arxiv.org\/abs\/2010.04295\">Widget Caption<\/a> uses view hierarchy and the image of each UI object to generate a text description for the object. Similarly, <a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/3472749.3474765\">Screen2Words<\/a> uses view hierarchy and the screenshot as well as auxiliary features (e.g., app description) to generate a summary for the screen. In the same vein, <a href=\"https:\/\/arxiv.org\/abs\/2112.05692\">VUT<\/a> combines screenshots and view hierarchies for performing multiple tasks. Finally, the original <a href=\"https:\/\/arxiv.org\/abs\/1902.11247\">Tappability<\/a> model leverages object metadata from view hierarchy and the screenshot to predict object tappability. <a href=\"https:\/\/dl.acm.org\/doi\/abs\/10.1145\/3491102.3517497\">Taperception<\/a>, a follow-up model of <a href=\"https:\/\/arxiv.org\/abs\/1902.11247\">Tappability<\/a>, uses a vision-only tappability prediction approach. We examine two Spotlight model variants with respect to the size of its ViT building block, including <a href=\"https:\/\/arxiv.org\/pdf\/2010.11929.pdf\">B\/16 and L\/16<\/a>. Spotlight drastically exceeded the state-of-the-art across four UI modeling tasks.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td>\n   <\/td>\n<td><b>Model<\/b> <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td><b>Captioning<\/b> <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td><b>Summarization<\/b><\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td><b>Grounding<\/b><\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td><b>Tappability<\/b><\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<\/tr>\n<tr>\n<td rowspan=\"4\">Baselines\u00a0\u00a0\u00a0<br \/>\n   <\/td>\n<td>Widget Caption<\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0\n   <\/td>\n<td style=\"text-align: center;\">97\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">&#8211;\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">&#8211;\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">&#8211;\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<\/tr>\n<tr>\n<td>Screen2Words<\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0\n   <\/td>\n<td style=\"text-align: center;\">&#8211;\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">61.3\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">&#8211;\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">&#8211;\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<\/tr>\n<tr>\n<td>VUT<\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0\n   <\/td>\n<td style=\"text-align: center;\">99.3\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">65.6\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">82.1\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">&#8211;\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<\/tr>\n<tr>\n<td>Taperception<\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0\n   <\/td>\n<td style=\"text-align: center;\">&#8211;\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">&#8211;\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">&#8211;\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">85.5\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<\/tr>\n<tr>\n<td>\n   <\/td>\n<td>Tappability<\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0\n   <\/td>\n<td style=\"text-align: center;\">&#8211;\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">&#8211;\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">&#8211;\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">87.9\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<\/tr>\n<tr><\/p>\n<td rowspan=\"2\">Spotlight\u00a0\u00a0\u00a0\n   <\/td>\n<p><\/p>\n<td>B\/16<\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0\n   <\/td>\n<td style=\"text-align: center;\">136.6\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">103.5\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">95.7\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">86.9\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<\/tr>\n<tr>\n<td>L\/16<\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0\n   <\/td>\n<td style=\"text-align: center;\"><b>141.8<\/b>\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\"><b>106.7<\/b>\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\"><b>95.8<\/b>\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\"><b>88.4<\/b>\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nWe then pursue a more challenging setup where we ask the model to learn multiple tasks simultaneously because a multi-task model can substantially reduce <a href=\"https:\/\/ai.googleblog.com\/2022\/02\/good-news-about-carbon-footprint-of.html\">model footprint<\/a>. As shown in the table below, the experiments showed that our model still performs competitively.\n<\/p>\n<p><\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td><b>Model<\/b>\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td><b>Captioning<\/b>\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td><b>Summarization<\/b>\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td><b>Grounding<\/b>\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td><b>Tappability<\/b>\n   <\/td>\n<\/tr>\n<tr>\n<td>VUT multi-task\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">99.3\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">65.1\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">80.8\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">&#8211;\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<\/tr>\n<tr>\n<td>Spotlight B\/16\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">140\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\"><b>102.7<\/b>\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">90.8\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">89.4\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<\/tr>\n<tr>\n<td>Spotlight L\/16\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\"><b>141.3<\/b>\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\">99.2\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\"><b>94.2<\/b>\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<td style=\"text-align: center;\"><b>89.5<\/b>\n   <\/td>\n<td>\u00a0\u00a0<\/td>\n<td>\u00a0\u00a0<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nTo understand how the Region Summarizer enables Spotlight to focus on a target region and relevant areas on the screen, we analyze the <a href=\"https:\/\/arxiv.org\/pdf\/1409.0473.pdf\">attention weights<\/a> (which indicate where the model attention is on the screenshot) for both widget captioning and screen summarization tasks. In the figure below, for the widget captioning task, the model predicts \u201cselect Chelsea team\u201d for the checkbox on the left side, highlighted with a red bounding box. We can see from its attention heatmap (which illustrates the distribution of attention weights) on the right that the model learns to attend to not only the target region of the check box, but also the text \u201cChelsea&#8221; on the far left to generate the caption. For the screen summarization example, the model predicts \u201cpage displaying the tutorial of a learning app\u201d given the screenshot on the left. In this example, the target region is the entire screen, and the model learns to attend to important parts on the screen for summarization.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEj_r1cpaXEO2qjoG8MXFxScLhbQbQXPPfOktzNSuxb9S-2Bp-DafgARZXosnlEFR9yhkaQ8ciOxvw_-5lYjreMPBFRe2ni3wXHNK9pInnwnzderDAqiYU1ECe67HSKhbuQT7oa42puMH1D-jQc-JSSBQLY59VetohPfkhnclfotzdZUCpbsJub2d1F7-A\/s1494\/image2.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"1310\" data-original-width=\"1494\" height=\"561\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEj_r1cpaXEO2qjoG8MXFxScLhbQbQXPPfOktzNSuxb9S-2Bp-DafgARZXosnlEFR9yhkaQ8ciOxvw_-5lYjreMPBFRe2ni3wXHNK9pInnwnzderDAqiYU1ECe67HSKhbuQT7oa42puMH1D-jQc-JSSBQLY59VetohPfkhnclfotzdZUCpbsJub2d1F7-A\/w640-h561\/image2.png\" width=\"640\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">For the widget captioning task, the attention heatmap shows the model attending to the checkbox, i.e., the target object, and the text label on its left when generating a caption for the object. The red bounding box in the figure is for illustration purposes.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEg6Kb1r05IMZFz-JeyY32XxCuvbeWqBijy3uLjzMzvlI5VP27HaQCWp4mD3fETaWT5GXBO-vN0SiD71liRHX9OtexDRBGZVPblObBNFluOJEPUlP3o8EvPaS40TwOhWBklbH2wc4kLyR705t05hVOWr3Tixi5NE_flPJYcs1N7doxsEpPS9X7OnxrVInQ\/s1492\/image1.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"1312\" data-original-width=\"1492\" height=\"563\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEg6Kb1r05IMZFz-JeyY32XxCuvbeWqBijy3uLjzMzvlI5VP27HaQCWp4mD3fETaWT5GXBO-vN0SiD71liRHX9OtexDRBGZVPblObBNFluOJEPUlP3o8EvPaS40TwOhWBklbH2wc4kLyR705t05hVOWr3Tixi5NE_flPJYcs1N7doxsEpPS9X7OnxrVInQ\/w640-h563\/image1.png\" width=\"640\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">For the screen summarization task that the target region encloses the entire screen, the attention heatmap shows the model attending to various locations on the screen that contribute to generating the summary.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Conclusion<\/h2>\n<p>\nWe demonstrate that Spotlight outperforms previous methods that use both screenshots and view hierarchies as the input, and establishes state-of-the-art results on multiple representative UI tasks. These tasks range from accessibility, automation to interaction design and evaluation. Our vision-only approach for mobile UI understanding alleviates the need to use view hierarchy, allows the architecture to easily scale and benefits from the success of large vision-language models pre-trained for the general domain. Compared to recent large vision-language model efforts such as <a href=\"https:\/\/www.deepmind.com\/blog\/tackling-multiple-tasks-with-a-single-visual-language-model\">Flamingo<\/a> and <a href=\"https:\/\/ai.googleblog.com\/2022\/09\/pali-scaling-language-image-learning-in.html\">PaLI<\/a>, Spotlight is relatively small and our experiments show the trend that larger models yield better performance. Spotlight can be easily applied to more UI tasks and potentially advance the fronts of many interaction and user experience tasks.\n<\/p>\n<h2>Acknowledgment<\/h2>\n<p>\n<i>We thank Mandar Joshi and Tao Li for their help in processing the web pre-training dataset, and Chin-Yi Cheng and Forrest Huang for their feedback for proofreading the paper. Thanks to Tom Small for his help in creating animated figures in this post.<\/i>\n<\/p>\n<\/div>\n<p>[ad_2]<br \/>\n<br \/><a href=\"http:\/\/ai.googleblog.com\/2023\/02\/a-vision-language-approach-for.html\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] Posted by Yang Li, Research Scientist, and Gang Li, Software Engineer, Google Research The computational understanding of<\/p>\n","protected":false},"author":2,"featured_media":365,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-364","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-google-ai"],"featured_image_urls":{"full":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/02\/Spotlight-hero.jpeg",540,540,false],"thumbnail":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/02\/Spotlight-hero-150x150.jpeg",150,150,true],"medium":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/02\/Spotlight-hero-300x300.jpeg",300,300,true],"medium_large":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/02\/Spotlight-hero.jpeg",540,540,false],"large":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/02\/Spotlight-hero.jpeg",540,540,false],"1536x1536":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/02\/Spotlight-hero.jpeg",540,540,false],"2048x2048":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/02\/Spotlight-hero.jpeg",540,540,false],"broadnews-featured":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/02\/Spotlight-hero.jpeg",540,540,false],"broadnews-large":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/02\/Spotlight-hero.jpeg",540,540,false],"broadnews-medium":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/02\/Spotlight-hero-540x410.jpeg",540,410,true]},"author_info":{"info":["Sanna"]},"category_info":"<a href=\"https:\/\/todaysainews.com\/index.php\/category\/google-ai\/\" rel=\"category tag\">Google AI<\/a>","tag_info":"Google AI","comment_count":"0","_links":{"self":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/364","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/comments?post=364"}],"version-history":[{"count":1,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/364\/revisions"}],"predecessor-version":[{"id":2890,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/364\/revisions\/2890"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media\/365"}],"wp:attachment":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media?parent=364"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/categories?post=364"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/tags?post=364"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}