{"id":692,"date":"2023-08-28T21:26:33","date_gmt":"2023-08-28T21:26:33","guid":{"rendered":"https:\/\/todaysainews.com\/index.php\/2023\/08\/28\/region-aware-pre-training-for-open-vocabulary-object-detection-with-vision-transformers-google-research-blog\/"},"modified":"2025-04-27T07:33:00","modified_gmt":"2025-04-27T07:33:00","slug":"region-aware-pre-training-for-open-vocabulary-object-detection-with-vision-transformers-google-research-blog","status":"publish","type":"post","link":"https:\/\/todaysainews.com\/index.php\/2023\/08\/28\/region-aware-pre-training-for-open-vocabulary-object-detection-with-vision-transformers-google-research-blog\/","title":{"rendered":"Region-aware pre-training for open-vocabulary object detection with vision transformers \u2013 Google Research Blog"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div id=\"post-body-781660063956467509\">\n<span class=\"byline-author\">Posted by Dahun Kim and Weicheng Kuo, Research Scientists, Google<\/span><br \/>\n<img decoding=\"async\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEj9wOjJcc9-JUy0J6NEo8aRgBIeiHRY6YdneL3pBlAF4GszMf6MctGLuZG5ZClFHqMGK9j_RpgF-M2AvcScwa98FwLHtEt1rC7HCiSPhnNpG0podsHDn8uKlh9fVuIj5xYGUFytZWHkE4pANrDnXLknL-7_FTTEYVtL2MVR-DMwREMdxi3TeGZKw1OcLiPI\/s1100\/RO-ViT-hero.jpg\" style=\"display: none;\"\/><\/p>\n<p>\nThe ability to detect objects in the visual world is crucial for computer vision and machine intelligence, enabling applications like adaptive autonomous agents and versatile shopping systems. However, modern object detectors are limited by the manual annotations of their training data, resulting in a vocabulary size significantly smaller than the vast array of objects encountered in reality. To overcome this, the <a href=\"https:\/\/arxiv.org\/abs\/2104.13921\">open-vocabulary detection task<\/a> (OVD) has emerged, utilizing image-text pairs for training and incorporating new category names at test time by associating them with the image content. By treating categories as text embeddings, open-vocabulary detectors can predict a wide range of unseen objects. Various techniques such as <a href=\"https:\/\/arxiv.org\/abs\/2103.00020\">image-text pre-training<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2104.13921\">knowledge distillation<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2112.09106\">pseudo labeling<\/a>, and frozen models, often employing <a href=\"https:\/\/en.wikipedia.org\/wiki\/Convolutional_neural_network\">convolutional neural network<\/a> (CNN) backbones, have been proposed. With the growing popularity of <a href=\"https:\/\/ai.googleblog.com\/2020\/12\/transformers-for-image-recognition-at.html\">vision transformers<\/a> (ViTs), it is important to explore their potential for building proficient open-vocabulary detectors.\n<\/p>\n<p><a name=\"more\"\/><\/p>\n<p>\nThe existing approaches assume the availability of pre-trained <a href=\"https:\/\/arxiv.org\/abs\/2103.00020\">vision-language models<\/a> (VLMs) and focus on fine-tuning or <a href=\"https:\/\/en.wikipedia.org\/wiki\/Knowledge_distillation\">distillation<\/a> from these models to address the disparity between image-level pre-training and object-level fine-tuning. However, as VLMs are primarily designed for image-level tasks like classification and retrieval, they do not fully leverage the concept of objects or regions during the pre-training phase. Thus, it could be beneficial for open-vocabulary detection if we build locality information into the image-text pre-training.\n<\/p>\n<p>\nIn \u201c<a href=\"https:\/\/arxiv.org\/abs\/2305.07011\">RO-ViT: Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers<\/a>\u201d, presented at <a href=\"https:\/\/cvpr2023.thecvf.com\/\">CVPR 2023<\/a>, we introduce a simple method to pre-train vision transformers in a region-aware manner to improve open-vocabulary detection. In vision transformers, positional embeddings are added to image patches to encode information about the spatial position of each patch within the image. Standard pre-training typically uses full-image positional embeddings, which does not generalize well to detection tasks. Thus, we propose a new positional embedding scheme, called \u201ccropped positional embedding\u201d, that better aligns with the use of region crops in detection fine-tuning. In addition, we replace the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Softmax_function#Neural_networks\">softmax cross entropy loss<\/a> with <a href=\"https:\/\/arxiv.org\/abs\/1708.02002\">focal loss<\/a> in contrastive image-text learning, allowing us to learn from more challenging and informative examples. Finally, we leverage recent advances in novel object proposals to enhance open-vocabulary detection fine-tuning, which is motivated by the observation that existing methods often miss novel objects during the proposal stage due to overfitting to foreground categories. We are also releasing the code <a href=\"https:\/\/github.com\/google-research\/google-research\/tree\/master\/fvlm\/rovit\">here<\/a>.\n<\/p>\n<p> <\/p>\n<h2>Region-aware image-text pre-training<\/h2>\n<p>\nExisting VLMs are trained to match an image as a whole to a text description. However, we observe there is a mismatch between the way the positional embeddings are used in the existing contrastive pre-training approaches and open-vocabulary detection. The positional embeddings are important to transformers as they provide the information of where each element in the set comes from. This information is often useful for downstream recognition and localization tasks. Pre-training approaches typically apply full-image positional embeddings during training, and use the same positional embeddings for downstream tasks, e.g., zero-shot recognition. However, the recognition occurs at region-level for open-vocabulary detection fine-tuning, which requires the full-image positional embeddings to generalize to regions that they never see during the pre-training.\n<\/p>\n<p>\nTo address this, we propose <em>cropped positional embeddings<\/em> (CPE). With CPE, we upsample positional embeddings from the image size typical for pre-training, e.g., 224&#215;224 pixels, to that typical for detection tasks, e.g., 1024&#215;1024 pixels. Then we randomly crop and resize a region, and use it as the image-level positional embeddings during pre-training. The position, scale, and aspect ratio of the crop is randomly sampled. Intuitively, this causes the model to view an image not as a full image in itself, but as a region crop from some larger unknown image. This better matches the downstream use case of detection where recognition occurs at region- rather than image-level.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjh720RAiQLlaXc5jlv5hPp7qnNXPXyEV8bUc6GNKJd9dckzmgvusKgIggqPP8NXvvbUb55TzSDP-gAdhl4gIq0CxXZqTJq4heQruWCeaQsM5uY2LKiO91-n7fPkUtnW2cYg3YP4iKC520pLXWNNG-iDQk80xsadhW-qTytYg44DWYu379-BaDczwm_vGoE\/s952\/RO-ViT-img6.jpg\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"594\" data-original-width=\"952\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjh720RAiQLlaXc5jlv5hPp7qnNXPXyEV8bUc6GNKJd9dckzmgvusKgIggqPP8NXvvbUb55TzSDP-gAdhl4gIq0CxXZqTJq4heQruWCeaQsM5uY2LKiO91-n7fPkUtnW2cYg3YP4iKC520pLXWNNG-iDQk80xsadhW-qTytYg44DWYu379-BaDczwm_vGoE\/s16000\/RO-ViT-img6.jpg\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">For the pre-training, we propose\u00a0<em>cropped positional embedding<\/em>\u00a0(CPE) which randomly crops and resizes a region of positional embeddings instead of using the whole-image positional embedding (PE). In addition, we use focal loss instead of the common softmax cross entropy loss for contrastive learning.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nWe also find it beneficial to learn from hard examples with a focal loss. Focal loss enables finer control over how hard examples are weighted than what the softmax cross entropy loss can provide. We adopt the focal loss and replace it with the softmax cross entropy loss in both image-to-text and text-to-image losses. Both CPE and focal loss introduce no extra parameters and minimal computation costs.\n<\/p>\n<p> <\/p>\n<h2>Open-vocabulary detector fine-tuning<\/h2>\n<p>\nAn open-vocabulary detector is trained with the detection labels of \u2018base\u2019 categories, but needs to detect the union of \u2018base\u2019 and \u2018novel\u2019 (unlabeled) categories at test time. Despite the backbone features pre-trained from the vast open-vocabulary data, the added detector layers (neck and heads) are newly trained with the downstream detection dataset. Existing approaches often miss novel\/unlabeled objects in the object proposal stage because the proposals tend to classify them as background. To remedy this, we leverage recent advances in a novel object proposal method and adopt the localization quality-based objectness (i.e., <a href=\"https:\/\/arxiv.org\/abs\/2108.06753\">centerness<\/a> score) instead of object-or-not binary classification score, which is combined with the detection score. During training, we compute the detection scores for each detected region as the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Cosine_similarity\">cosine similarity<\/a> between the region\u2019s embedding (computed via <a href=\"https:\/\/en.wikipedia.org\/wiki\/Region_Based_Convolutional_Neural_Networks\">RoI-Align<\/a> operation) and the text embeddings of the base categories. At test time, we append the text embeddings of novel categories, and the detection score is now computed with the union of the base and novel categories.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEhpulnBZgXTo37R7jq14m8vdkKd0xQIeSAqBF5mleY1EAMd1Uu1IY-Sf8cMhTlp-iIR56umGVirxfwtpNFeEpaCJw7MCZE0IsOhdkt8ny6ipDfFi4BxNOgYdmrP4Vsw874IF5US7uDBrOSwP7J2yYemDmJqp4q8E4TXnLoT0r0xZpAu2dI-YBskOGZKPvNo\/s682\/RO-ViT-img4.jpg\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"592\" data-original-width=\"682\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEhpulnBZgXTo37R7jq14m8vdkKd0xQIeSAqBF5mleY1EAMd1Uu1IY-Sf8cMhTlp-iIR56umGVirxfwtpNFeEpaCJw7MCZE0IsOhdkt8ny6ipDfFi4BxNOgYdmrP4Vsw874IF5US7uDBrOSwP7J2yYemDmJqp4q8E4TXnLoT0r0xZpAu2dI-YBskOGZKPvNo\/s16000\/RO-ViT-img4.jpg\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">The pre-trained ViT backbone is transferred to the downstream open-vocabulary detection by replacing the global average pooling with detector heads. The\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Region_Based_Convolutional_Neural_Networks#:~:text=new%20method%20called-,ROIAlign,-%2C%20which%20can%20represent\" style=\"text-align: left;\">RoI-Align<\/a>\u00a0embeddings are matched with the cached category embeddings to obtain the VLM score, which is combined with the detection score into the open-vocabulary detection score.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Results<\/h2>\n<p>\nWe evaluate RO-ViT on the <a href=\"https:\/\/arxiv.org\/abs\/1908.03195\">LVIS<\/a> open-vocabulary detection benchmark. At the system-level, our best model achieves 33.6 box\u00a0<a href=\"https:\/\/blog.paperspace.com\/mean-average-precision\/\">average precision<\/a>\u00a0on rare categories (<a href=\"https:\/\/arxiv.org\/abs\/1908.03195\">AP<sub>r<\/sub><\/a>) and 32.1 <a href=\"https:\/\/cocodataset.org\/#home\">mask AP<sub>r<\/sub><\/a>, which outperforms\u00a0the best existing ViT-based approach OWL-ViT by 8.0 AP<sub>r<\/sub>\u00a0and the best CNN-based approach ViLD-Ens by 5.8 mask AP<sub>r<\/sub>. It also exceeds the performance of many other approaches based on knowledge distillation, pre-training, or joint training with weak supervision.<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEhJwMkd6yjqUGQQwd3AvfjVXHMO1Fqm1nJ10dCRe1YwArbbOihweKhq0ITBAhdDliZvaRxlYel-0Y3ovMWmY_Mq-BEQPNDs5PwmvwugHGGwTp7jQHrll2CF-MIRibhJ9u4CTihtnhb_HS4Gn01prYhJgAkV7YYFCeuEec_N9EIv7X3vM_STQv9w52zmcAcM\/s1200\/image2.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"426\" data-original-width=\"1200\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEhJwMkd6yjqUGQQwd3AvfjVXHMO1Fqm1nJ10dCRe1YwArbbOihweKhq0ITBAhdDliZvaRxlYel-0Y3ovMWmY_Mq-BEQPNDs5PwmvwugHGGwTp7jQHrll2CF-MIRibhJ9u4CTihtnhb_HS4Gn01prYhJgAkV7YYFCeuEec_N9EIv7X3vM_STQv9w52zmcAcM\/s16000\/image2.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">RO-ViT outperforms both the state-of-the-art (SOTA)\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2205.06230\" style=\"text-align: left;\">ViT-based<\/a>\u00a0and\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2104.13921\" style=\"text-align: left;\">CNN-based<\/a>\u00a0methods on LVIS open-vocabulary detection benchmark. We show mask AP on rare categories (AP<sub>r<\/sub>) , except for SOTA ViT-based (OwL-ViT) where we show box AP.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nApart from evaluating region-level representation through open-vocabulary detection, we evaluate the image-level representation of RO-ViT in image-text retrieval through the <a href=\"https:\/\/arxiv.org\/abs\/1504.00325\">MS-COCO<\/a> and <a href=\"https:\/\/arxiv.org\/abs\/1505.04870\">Flickr30K<\/a> benchmarks. Our model with 303M ViT outperforms the state-of-the-art CoCa model with 1B ViT on MS COCO, and is on par on Flickr30K. This shows that our pre-training method not only improves the region-level representation but also the global image-level representation for retrieval.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEgvv3eObc3ZSKGIygKA7leAfFIrVUx4EXmvx3O17d4TA1Gf1qLAOPRBQ0euAWH6mZAkmTmZBXVXy6s2NqESWDlbGpeRKVrwp3_wXzf8KGqaY50rK3fLcWtP-rfi1gDRiWkhuVDKVr1QqOGumfyJV-GrK5yI7XH0xFlrWXZISsQzAYpBK9wTdP0JX4b36s0o\/s1692\/image5.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"940\" data-original-width=\"1692\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEgvv3eObc3ZSKGIygKA7leAfFIrVUx4EXmvx3O17d4TA1Gf1qLAOPRBQ0euAWH6mZAkmTmZBXVXy6s2NqESWDlbGpeRKVrwp3_wXzf8KGqaY50rK3fLcWtP-rfi1gDRiWkhuVDKVr1QqOGumfyJV-GrK5yI7XH0xFlrWXZISsQzAYpBK9wTdP0JX4b36s0o\/s16000\/image5.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">We show zero-shot image-text retrieval on MS COCO and Flickr30K benchmarks, and compare with dual-encoder methods. We report recall@1 (top-1 recall) on image-to-text (I2T) and text-to-image (T2I) retrieval tasks. RO-ViT outperforms the state-of-the-art\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2205.01917\" style=\"text-align: left;\">CoCa<\/a>\u00a0with the same backbone.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEhh_nm9hEfD99tDp8Jtzx7DVwdsylNNAdBFGx-JMmj8EJGf1JeNd3BboFEPmf0lxbHCizm_vTGqlCYlal0PZYV2rJiFfI7jUZdtKIYBg3Dr9uUJA_UvRhQ9M9HBzT-03RPv3ZhGm7rj86AxOYqQH1KWYHkUqHyMPU_lx9wGBWX-0nYS_ICu9hRlGYPCBxjk\/s1476\/image3.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"362\" data-original-width=\"1476\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEhh_nm9hEfD99tDp8Jtzx7DVwdsylNNAdBFGx-JMmj8EJGf1JeNd3BboFEPmf0lxbHCizm_vTGqlCYlal0PZYV2rJiFfI7jUZdtKIYBg3Dr9uUJA_UvRhQ9M9HBzT-03RPv3ZhGm7rj86AxOYqQH1KWYHkUqHyMPU_lx9wGBWX-0nYS_ICu9hRlGYPCBxjk\/s16000\/image3.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">RO-ViT open-vocabulary detection on LVIS. We only show the novel categories for clarity. RO-ViT detects many novel categories that it has never seen during detection training: \u201cfishbowl\u201d, \u201csombrero\u201d, \u201cpersimmon\u201d, \u201cgargoyle\u201d.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Visualization of positional embeddings<\/h2>\n<p>\nWe visualize and compare the learned positional embeddings of RO-ViT with the baseline. Each tile is the cosine similarity between positional embeddings of one patch and all other patches. For example, the tile in the top-left corner (marked in red) visualizes the similarity between the positional embedding of the location (row=1, column=1) and those positional embeddings of all other locations in 2D. The brightness of the patch indicates how close the learned positional embeddings of different locations are. RO-ViT forms more distinct clusters at different patch locations showing symmetrical global patterns around the center patch.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEhqfeYO7FKWB_ZyoOpvwArOkfRn77jYWBJ7X1_wVdjZQRVdeXJKDTtaGwBpR6XbvK7L6_OgcDWEwESYAbXMevRwStwANdQkmi6f2w62aXNhkEu3daexyXV5F9wNYvZubp1hQE9oh3P602suQr4KOKcRT1f5Jr8yluYSe2QfA9h6uLSZEWB4hiwu24z6kttD\/s1686\/RO-ViT-img1.jpg\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"648\" data-original-width=\"1686\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEhqfeYO7FKWB_ZyoOpvwArOkfRn77jYWBJ7X1_wVdjZQRVdeXJKDTtaGwBpR6XbvK7L6_OgcDWEwESYAbXMevRwStwANdQkmi6f2w62aXNhkEu3daexyXV5F9wNYvZubp1hQE9oh3P602suQr4KOKcRT1f5Jr8yluYSe2QfA9h6uLSZEWB4hiwu24z6kttD\/s16000\/RO-ViT-img1.jpg\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">Each tile shows the cosine similarity between the positional embedding of the patch (at the indicated row-column position) and the positional embeddings of all other patches. ViT-B\/16 backbone is used.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Conclusion<\/h2>\n<p>\nWe present RO-ViT, a contrastive image-text pre-training framework to bridge the gap between image-level pre-training and open-vocabulary detection fine-tuning. Our methods are simple, scalable, and easy to apply to any contrastive backbones with minimal computation overhead and no increase in parameters. RO-ViT achieves the state-of-the-art on LVIS open-vocabulary detection benchmark and on the image-text retrieval benchmarks, showing the learned representation is not only beneficial at region-level but also highly effective at the image-level. We hope this study can help the research on open-vocabulary detection from the perspective of image-text pre-training which can benefit both region-level and image-level tasks.\n<\/p>\n<p> <\/p>\n<h2>Acknowledgements <\/h2>\n<p>\n<em>Dahun Kim, Anelia Angelova, and Weicheng Kuo conducted this work and are now at Google DeepMind. We would like to thank our colleagues at Google Research for their advice and helpful discussions. <\/em>\n<\/p>\n<\/div>\n<p>[ad_2]<br \/>\n<br \/><a href=\"http:\/\/ai.googleblog.com\/2023\/08\/ro-vit-region-aware-pre-training-for.html\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] Posted by Dahun Kim and Weicheng Kuo, Research Scientists, Google The ability to detect objects in the<\/p>\n","protected":false},"author":2,"featured_media":693,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-692","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-google-ai"],"featured_image_urls":{"full":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/08\/RO-ViT-hero.jpg",1100,630,false],"thumbnail":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/08\/RO-ViT-hero-150x150.jpg",150,150,true],"medium":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/08\/RO-ViT-hero-300x172.jpg",300,172,true],"medium_large":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/08\/RO-ViT-hero-768x440.jpg",640,367,true],"large":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/08\/RO-ViT-hero-1024x586.jpg",640,366,true],"1536x1536":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/08\/RO-ViT-hero.jpg",1100,630,false],"2048x2048":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/08\/RO-ViT-hero.jpg",1100,630,false],"broadnews-featured":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/08\/RO-ViT-hero-1024x586.jpg",1024,586,true],"broadnews-large":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/08\/RO-ViT-hero-825x575.jpg",825,575,true],"broadnews-medium":["https:\/\/todaysainews.com\/wp-content\/uploads\/2023\/08\/RO-ViT-hero-590x410.jpg",590,410,true]},"author_info":{"info":["Sanna"]},"category_info":"<a href=\"https:\/\/todaysainews.com\/index.php\/category\/google-ai\/\" rel=\"category tag\">Google AI<\/a>","tag_info":"Google AI","comment_count":"0","_links":{"self":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/692","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/comments?post=692"}],"version-history":[{"count":1,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/692\/revisions"}],"predecessor-version":[{"id":2728,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/692\/revisions\/2728"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media\/693"}],"wp:attachment":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media?parent=692"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/categories?post=692"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/tags?post=692"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}