{"id":532,"date":"2023-05-05T00:15:03","date_gmt":"2023-05-05T00:15:03","guid":{"rendered":"https:\/\/todaysainews.com\/index.php\/2023\/05\/05\/a-simple-vision-encoder-text-decoder-architecture-for-multimodal-tasks-google-ai-blog\/"},"modified":"2025-04-27T07:33:42","modified_gmt":"2025-04-27T07:33:42","slug":"a-simple-vision-encoder-text-decoder-architecture-for-multimodal-tasks-google-ai-blog","status":"publish","type":"post","link":"https:\/\/todaysainews.com\/index.php\/2023\/05\/05\/a-simple-vision-encoder-text-decoder-architecture-for-multimodal-tasks-google-ai-blog\/","title":{"rendered":"A simple vision-encoder text-decoder architecture for multimodal tasks \u2013 Google AI Blog"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div id=\"post-body-7102791933497880989\">\n<span class=\"byline-author\">Posted by AJ Piergiovanni and Anelia Angelova, Research Scientists, Google Research<\/span><\/p>\n<p><img decoding=\"async\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEh53KlJZUXTHEd1ZhRav_9Hwl-MzCVTzans8VhEzushmfeKHUBfNDKTIPpVEbrDhtxlZWeBgLYsIsi6krB_GefP0SrNX-92H3eunTcCwjAH_t2KBW8wVMzZlvYbiltJM5xMFhy9Euclq7q33HgKgdvmsoXnOIbL-RkGMDeHn_ocy2puVKIqfkJ05REmuA\/s800\/MAMMUT.png\" style=\"display: none;\"\/><\/p>\n<p>\nVision-language foundational models are built on the premise of a single pre-training followed by subsequent adaptation to multiple downstream tasks. Two main and disjoint training scenarios are popular: a <a href=\"https:\/\/arxiv.org\/abs\/2103.00020\">CLIP<\/a>-style contrastive learning and next-token prediction. Contrastive learning trains the model to predict if image-text pairs correctly match, effectively building visual and text representations for the corresponding image and text inputs, whereas next-token prediction predicts the most likely next text token in a sequence, thus learning to generate text, according to the required task. <a href=\"https:\/\/arxiv.org\/abs\/2002.05709\">Contrastive learning<\/a> enables <a href=\"https:\/\/en.wikipedia.org\/wiki\/Image_retrieval\">image-text and text-image retrieval tasks<\/a>, such as finding the image that best matches a certain description, and next-token learning enables text-generative tasks, such as <a href=\"https:\/\/en.wikipedia.org\/wiki\/Natural_language_generation#Image_captioning\">Image Captioning<\/a> and <a href=\"https:\/\/visualqa.org\/\">Visual Question Answering<\/a> (VQA). While both approaches have demonstrated powerful results, when a model is pre-trained contrastively, it typically does not fare well on text-generative tasks and vice-versa. Furthermore, adaptation to other tasks is often done with complex or inefficient methods. For example, in order to extend a vision-language model to videos, some models need to do inference for each video frame separately. This limits the size of the videos that can be processed to only a few frames and does not fully take advantage of motion information available across frames.\n<\/p>\n<p> <a name=\"more\"\/><\/p>\n<p>\nMotivated by this, we present \u201c<a href=\"https:\/\/arxiv.org\/abs\/2303.16839\">A Simple Architecture for Joint Learning for MultiModal Tasks<\/a>\u201d, called MaMMUT, which is able to train jointly for these competing objectives and which provides a foundation for many vision-language tasks either directly or via simple adaptation. MaMMUT is a compact, 2B-parameter multimodal model that trains across contrastive, text generative, and localization-aware objectives. It consists of a single image encoder and a text decoder, which allows for a direct reuse of both components. Furthermore, a straightforward adaptation to video-text tasks requires only using the image encoder once and can handle many more frames than prior work. In line with recent language models (e.g., <a href=\"https:\/\/ai.googleblog.com\/2022\/04\/pathways-language-model-palm-scaling-to.html\">PaLM<\/a>, <a href=\"https:\/\/ai.googleblog.com\/2021\/12\/more-efficient-in-context-learning-with.html\">GLaM<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2005.14165\">GPT3<\/a>), our architecture uses a decoder-only text model and can be thought of as a simple extension of language models. While modest in size, our model outperforms the state of the art or achieves competitive performance on <a href=\"https:\/\/en.wikipedia.org\/wiki\/Image_retrieval\">image-text and text-image retrieval<\/a>, <a href=\"https:\/\/ai.googleblog.com\/2022\/08\/efficient-video-text-learning-with.html\">video question answering<\/a> (VideoQA), <a href=\"https:\/\/ai.googleblog.com\/2022\/06\/end-to-end-generative-pre-training-for.html\">video captioning<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2104.13921\">open-vocabulary detection<\/a>, and <a href=\"https:\/\/visualqa.org\/\">VQA<\/a>.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEhfXoqtrQ_BRyHf1n2D4dU-bUQZTrTBBDfLt3fSlhuQy1pAFOXANnp6WUsJlEQLdlOsOPG2y3iqJUDH0fB1lOUaxjGv3L9BoFCL4Y8vqC2Cya0iXM3sjhiYM_6rzGjSnsvfvTc8qqjPi8BIjppq2xswR3-gk4x6ysfu_FsN7G5_VCFr0kjx4ttYcn3LYA\/s1182\/image15.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"321\" data-original-width=\"1182\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEhfXoqtrQ_BRyHf1n2D4dU-bUQZTrTBBDfLt3fSlhuQy1pAFOXANnp6WUsJlEQLdlOsOPG2y3iqJUDH0fB1lOUaxjGv3L9BoFCL4Y8vqC2Cya0iXM3sjhiYM_6rzGjSnsvfvTc8qqjPi8BIjppq2xswR3-gk4x6ysfu_FsN7G5_VCFr0kjx4ttYcn3LYA\/s16000\/image15.png\"\/><\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEiGAXjkLnL2NXPDE4zPRjFmPblDtP_vR5G9s7ox3xZFgGUIVm5YROXJpaSGb1MJR3QHajoIwslBnEeTuSPhVT9FwxqvvUVmjBXUS838J3G5jmicy7Y2DYF_u-J8x0Avv9TJuq6yFFZc0Yi3sqCrclKGbLiMEuFUJxVHL6ij8fgXFOuKR2WOaeWYxST8VA\/s980\/image8.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"314\" data-original-width=\"980\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEiGAXjkLnL2NXPDE4zPRjFmPblDtP_vR5G9s7ox3xZFgGUIVm5YROXJpaSGb1MJR3QHajoIwslBnEeTuSPhVT9FwxqvvUVmjBXUS838J3G5jmicy7Y2DYF_u-J8x0Avv9TJuq6yFFZc0Yi3sqCrclKGbLiMEuFUJxVHL6ij8fgXFOuKR2WOaeWYxST8VA\/s16000\/image8.png\"\/><\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEiQErbPcXI1OwDK2NK155nBq67jBnBcYb5pCGJrupehsnE-WT6MpVWbMmcSZ7rwFBGsCIKQiILMzSxGmIgLMrdi9ULbHN7X_6y6Z3bpOActzoqziSIfee6L-vppDifAF3uVh6M61MtsQ-LM-gG4g5S4-k9Sga7IYVNMC5pGB8KxLRYjXDVZACAF_S4Okw\/s1120\/image10.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"390\" data-original-width=\"1120\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEiQErbPcXI1OwDK2NK155nBq67jBnBcYb5pCGJrupehsnE-WT6MpVWbMmcSZ7rwFBGsCIKQiILMzSxGmIgLMrdi9ULbHN7X_6y6Z3bpOActzoqziSIfee6L-vppDifAF3uVh6M61MtsQ-LM-gG4g5S4-k9Sga7IYVNMC5pGB8KxLRYjXDVZACAF_S4Okw\/s16000\/image10.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">The MaMMUT model enables a wide range of tasks such as image-text\/text-image retrieval (<b>top left<\/b> and <b>top right<\/b>), VQA (<b>middle left<\/b>), open-vocabulary detection (<b>middle right<\/b>), and VideoQA (<b>bottom<\/b>).<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Decoder-only model architecture<\/h2>\n<p>\nOne surprising finding is that a single language-decoder is sufficient for all these tasks, which obviates the need for both complex constructs and training procedures presented before. For example, our model (presented to the left in the figure below) consists of a single visual encoder and single text-decoder, connected via <a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">cross attention<\/a>, and trains simultaneously on both contrastive and text-generative types of losses. Comparatively, prior work is either not able to handle image-text retrieval tasks, or applies only some losses to only some parts of the model. To enable multimodal tasks and fully take advantage of the decoder-only model, we need to jointly train both contrastive losses and text-generative captioning-like losses.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjrJfuTxTaxiMFIo1c34KaN-CRsNQcUV4WSaxNTCV-y0CSQodJECx4k9QT0Ww6xjF1hVgJ7zioahWWiBWvMYoIiVKsDFF5p78ACe7f7j-M9DtfOwmnNznwt7pfCBmsO-jU-QiOWCbL4IiLgSXnYMa23b9LSk6Hyb25_ZVeRwPj86LSOvyQqpA1pugzFaw\/s1191\/image7.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"532\" data-original-width=\"1191\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjrJfuTxTaxiMFIo1c34KaN-CRsNQcUV4WSaxNTCV-y0CSQodJECx4k9QT0Ww6xjF1hVgJ7zioahWWiBWvMYoIiVKsDFF5p78ACe7f7j-M9DtfOwmnNznwt7pfCBmsO-jU-QiOWCbL4IiLgSXnYMa23b9LSk6Hyb25_ZVeRwPj86LSOvyQqpA1pugzFaw\/s16000\/image7.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">MaMMUT architecture (<b>left<\/b>) is a simple construct consisting of a single vision encoder and a single text decoder. Compared to other popular vision-language models \u2014 e.g., <a href=\"https:\/\/arxiv.org\/abs\/2209.06794\">PaLI<\/a> (<b>middle<\/b>) and <a href=\"https:\/\/arxiv.org\/abs\/2107.07651\">ALBEF<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/2205.01917\">CoCa<\/a> (<b>right<\/b>) \u2014 it trains jointly and efficiently for multiple vision-language tasks, with both contrastive and text-generative losses, fully sharing the weights between the tasks.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Decoder two-pass learning<\/h2>\n<p>\n<a href=\"https:\/\/arxiv.org\/abs\/1801.10198\">Decoder-only models<\/a> for language learning show clear advantages in performance with smaller model size (almost half the parameters). The main challenge for applying them to multimodal settings is to unify the contrastive learning (which uses unconditional sequence-level representation) with captioning (which optimizes the likelihood of a token conditioned on the previous tokens). We propose a two-pass approach to jointly learn these two conflicting types of text representations within the decoder. During the first pass, we utilize cross attention and <a href=\"https:\/\/www.tensorflow.org\/text\/tutorials\/transformer#the_causal_self_attention_layer\">causal masking<\/a> to learn the caption generation task \u2014 the text features can attend to the image features and predict the tokens in sequence. On the second pass, we disable the cross-attention and causal masking to learn the contrastive task. The text features will not see the image features but can attend bidirectionally to all text tokens at once to produce the final text-based representation. Completing this two-pass approach within the same decoder allows for accommodating both types of tasks that were previously hard to reconcile. While simple, we show that this model architecture is able to provide a foundation for multiple multimodal tasks.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjRePkirehczoN54jZwC69K_aoj7BraBtL3rHmX2Q6FDfkwauOF-ODzu-TzhMv5c2PdTZln40s_AImgBHz01ASHMN4ZybMA0gVmq6A2GSP9L9b3uslrnE3256A6v-hp-ZEy2H6Az1HCFhlnKt6h-YjN_BwS3rGMUnq_YzrKOWPoNGL31hcPJ6vpbBnNIg\/s670\/image4.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"314\" data-original-width=\"670\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjRePkirehczoN54jZwC69K_aoj7BraBtL3rHmX2Q6FDfkwauOF-ODzu-TzhMv5c2PdTZln40s_AImgBHz01ASHMN4ZybMA0gVmq6A2GSP9L9b3uslrnE3256A6v-hp-ZEy2H6Az1HCFhlnKt6h-YjN_BwS3rGMUnq_YzrKOWPoNGL31hcPJ6vpbBnNIg\/s16000\/image4.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">MaMMUT decoder-only two-pass learning enables both contrastive and generative learning paths by the same model.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nAnother advantage of our architecture is that, since it is trained for these disjoint tasks, it can be seamlessly applied to multiple applications such as image-text and text-image retrieval, VQA, and captioning.\n<\/p>\n<p>\nMoreover, MaMMUT easily adapts to video-language tasks. Previous approaches used a vision encoder to process each frame individually, which required applying it multiple times. This is slow and restricts the number of frames the model can handle, typically to only 6\u20138. With MaMMUT, we use <a href=\"https:\/\/arxiv.org\/abs\/2212.03229\">sparse video tubes<\/a> for lightweight adaptation directly via the spatio-temporal information from the video. Furthermore, adapting the model to <a href=\"https:\/\/arxiv.org\/abs\/2104.13921\">Open-Vocabulary Detection<\/a> is done by simply training to detect bounding-boxes via an object-detection head.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEihF47QWAtA-rAe5KLd6H5SlOjsT5SRS7GhLbujvzg0xpRTrNgsuttkJIOjFlMSoQGxKDkBL5kGVX0waDNr3-4ewW1pkZEHk8RcA39HX7w3j7lAcwOIw0OBcAiSSQRA1NERhMp2GMDpaHgkAlByZJ1uVcp-MWU4KFQJMBXt1Sr_duxGa3rN_X6GYIHh7w\/s632\/image5.png\" imageanchor=\"1\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"204\" data-original-width=\"632\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEihF47QWAtA-rAe5KLd6H5SlOjsT5SRS7GhLbujvzg0xpRTrNgsuttkJIOjFlMSoQGxKDkBL5kGVX0waDNr3-4ewW1pkZEHk8RcA39HX7w3j7lAcwOIw0OBcAiSSQRA1NERhMp2GMDpaHgkAlByZJ1uVcp-MWU4KFQJMBXt1Sr_duxGa3rN_X6GYIHh7w\/s16000\/image5.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">Adaptation of the MaMMUT architecture to video tasks (<b>left<\/b>) is simple and fully reuses the model. This is done by generating a video \u201ctubes\u201d feature representation, similar to image patches, that are projected to lower dimensional tokens and run through the vision encoder. Unlike prior approaches (<b>right<\/b>) that need to run multiple individual images through the vision encoder, we use it only once.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Results<\/h2>\n<p>\nOur model achieves excellent zero-shot results on image-text and text-image retrieval without any adaptation, outperforming all previous state-of-the-art models. The results on VQA are competitive with state-of-the-art results, which are achieved by much larger models. The <a href=\"https:\/\/arxiv.org\/abs\/2209.06794\">PaLI model<\/a> (17B parameters) and the <a href=\"https:\/\/storage.googleapis.com\/deepmind-media\/DeepMind.com\/Blog\/tackling-multiple-tasks-with-a-single-visual-language-model\/flamingo.pdf\">Flamingo model<\/a> (80B) have the best performance on the <a href=\"https:\/\/visualqa.org\/\">VQA2.0 dataset<\/a>, but MaMMUT (2B) has the same accuracy as the 15B PaLI.<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjMrJFan32k1qe9UYjl8S0wX9PyYc4myFwpjtL70v4w6FcZYkwZ_ruQd5NADBFfQrVJwO5I55Zl6_MhQSQsA2xojPH4CnBdNlWuz8zPAfxP8j54YQUC5252Hs3jN5ErpnkfiiILuXL5GBHBh4EKI9rETHhQmvXLGmN7lo5EoV_30kpHQFGnNufz3YX7yw\/s600\/image3.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"371\" data-original-width=\"600\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjMrJFan32k1qe9UYjl8S0wX9PyYc4myFwpjtL70v4w6FcZYkwZ_ruQd5NADBFfQrVJwO5I55Zl6_MhQSQsA2xojPH4CnBdNlWuz8zPAfxP8j54YQUC5252Hs3jN5ErpnkfiiILuXL5GBHBh4EKI9rETHhQmvXLGmN7lo5EoV_30kpHQFGnNufz3YX7yw\/s16000\/image3.png\"\/><\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEiIZMvhSGCzClyc92aahNZzS8ru6T0cMh6jY0JHwUVUMTXzcmZUmPj3E3Yq-j1Xy0AG_VFNmE5B--CFQACoQaJmPt8-bbo6FeA9oroJgGP-PUhNJGnwq4_J_C9virlIu30c7FGhLNGpzdYc7-H5dVUSAkJE3z9BMbev_ZL4T14gYxRJSPkcYDZu9Y8Qag\/s600\/image12.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"371\" data-original-width=\"600\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEiIZMvhSGCzClyc92aahNZzS8ru6T0cMh6jY0JHwUVUMTXzcmZUmPj3E3Yq-j1Xy0AG_VFNmE5B--CFQACoQaJmPt8-bbo6FeA9oroJgGP-PUhNJGnwq4_J_C9virlIu30c7FGhLNGpzdYc7-H5dVUSAkJE3z9BMbev_ZL4T14gYxRJSPkcYDZu9Y8Qag\/s16000\/image12.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">MaMMUT outperforms the state of the art (SOTA) on Zero-Shot Image-Text (I2T) and Text-Image (T2I) retrieval on both <a href=\"https:\/\/arxiv.org\/abs\/1504.00325\">MS-COCO<\/a> (<b>top<\/b>) and <a href=\"https:\/\/ieeexplore.ieee.org\/document\/7410660\">Flickr<\/a> (<b>bottom<\/b>) benchmarks.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEioruxeWxIKiiAjCLOb2_AT-DrSBI0El7BDO_hO2U-LLDnJXhu2N6c2lfVEj7bIp2GJ2oF2UPnuXUBQwRy_z_Kvfu_2PqShjKi1Ak3kvxRecmHHIcnZGhD6cyEnMy72PAf2izaEa_DMSBhlfjtjLQyjSptFuHtNHry6MEnL5LbYMOi9vqtat70WpN3pbA\/s600\/image2.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"371\" data-original-width=\"600\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEioruxeWxIKiiAjCLOb2_AT-DrSBI0El7BDO_hO2U-LLDnJXhu2N6c2lfVEj7bIp2GJ2oF2UPnuXUBQwRy_z_Kvfu_2PqShjKi1Ak3kvxRecmHHIcnZGhD6cyEnMy72PAf2izaEa_DMSBhlfjtjLQyjSptFuHtNHry6MEnL5LbYMOi9vqtat70WpN3pbA\/s16000\/image2.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">Performance on the <a href=\"https:\/\/visualqa.org\/\">VQA2.0 dataset<\/a> is competitive but does not outperform large models such as Flamingo-80B and PalI-17B. Performance is evaluated in the more challenging open-ended text generation setting.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nMaMMUT also outperforms the state-of-the-art on VideoQA, as shown below on the <a href=\"https:\/\/ieeexplore.ieee.org\/document\/7780940\">MSRVTT-QA<\/a> and <a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/3123266.3123427\">MSVD-QA<\/a> datasets. Note that we outperform much bigger models such as <a href=\"https:\/\/storage.googleapis.com\/deepmind-media\/DeepMind.com\/Blog\/tackling-multiple-tasks-with-a-single-visual-language-model\/flamingo.pdf\">Flamingo<\/a>, which is specifically designed for image+video pre-training and is pre-trained with both image-text and video-text data.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEgQIu-j_ORVRUXKeMjJNnLbvTScNct-Fv14_jy-y2kvMbu0UcuNOm_pQTe1fqA8XblGMS4qNbbCP0wk3EwpXBwwpBmjOqIm0WznRqKPSGiAkjtoOkG3K6hHB5Sa-FpeAdkaF8_KaXdz-Hext1_SBlMFeY8TRSFMXwtl1xR6sITQ6xMm7XkMrd9XrIYhbg\/s600\/image13.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"371\" data-original-width=\"600\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEgQIu-j_ORVRUXKeMjJNnLbvTScNct-Fv14_jy-y2kvMbu0UcuNOm_pQTe1fqA8XblGMS4qNbbCP0wk3EwpXBwwpBmjOqIm0WznRqKPSGiAkjtoOkG3K6hHB5Sa-FpeAdkaF8_KaXdz-Hext1_SBlMFeY8TRSFMXwtl1xR6sITQ6xMm7XkMrd9XrIYhbg\/s16000\/image13.png\"\/><\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEhwn4A39cTqXGhA-HgDPSERXNOI6hggrFSmWGtCSIbCbOnAilYKlhlDa_H4X6QuU7DiVNaya6uITKlFFod33Vlm5M631dp73u5zsZVQBu8AyPpu25JZMDnyiXmqgnr_z3KEiQ-BoUhjPcYWweM01mJzdKLs6i0-30fVkjqK2bvGg2ppkzBjaI_Tw1YPWQ\/s600\/image9.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"371\" data-original-width=\"600\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEhwn4A39cTqXGhA-HgDPSERXNOI6hggrFSmWGtCSIbCbOnAilYKlhlDa_H4X6QuU7DiVNaya6uITKlFFod33Vlm5M631dp73u5zsZVQBu8AyPpu25JZMDnyiXmqgnr_z3KEiQ-BoUhjPcYWweM01mJzdKLs6i0-30fVkjqK2bvGg2ppkzBjaI_Tw1YPWQ\/s16000\/image9.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">MaMMUT outperforms the SOTA models on VideoQA tasks (MSRVTT-QA dataset, <b>top<\/b>, MSVD-QA dataset,<b> bottom<\/b>), outperforming much larger models, e.g., the 5B GIT2 or Flamingo, which uses 80B parameters and is pre-trained for both image-language and vision-language tasks.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\nOur results outperform the state-of-the-art on open-vocabulary detection fine-tuning as is also shown below.\n<\/p>\n<h2>Key ingredients<\/h2>\n<p>\nWe show that joint training of both contrastive and text-generative objectives is not an easy task, and in our ablations we find that these tasks are served better by different design choices. We see that fewer cross-attention connections are better for retrieval tasks, but more are preferred by VQA tasks. Yet, while this shows that our model\u2019s design choices might be suboptimal for individual tasks, our model is more effective than more complex, or larger, models.\n<\/p>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjBEkGSnnhxudBxkKF6j9y2VDIsHw6_TVfB2e3a0zxuI70LR7BS36oWUXoBG13tOqVMUygmi7uLHh6rnOWEALSdujHNh5iSUfOEMp2hTw0L4I-t5Jp2O8Sj0hpoYoC50asshxoaQgrdMEci-c7zPOaS9znnWy8WJqdJveWlDlN1k5UKLIJDcS0-Ipru-Q\/s600\/image11.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"371\" data-original-width=\"600\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjBEkGSnnhxudBxkKF6j9y2VDIsHw6_TVfB2e3a0zxuI70LR7BS36oWUXoBG13tOqVMUygmi7uLHh6rnOWEALSdujHNh5iSUfOEMp2hTw0L4I-t5Jp2O8Sj0hpoYoC50asshxoaQgrdMEci-c7zPOaS9znnWy8WJqdJveWlDlN1k5UKLIJDcS0-Ipru-Q\/s16000\/image11.png\"\/><\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\">\n<tbody>\n<tr>\n<td style=\"text-align: center;\"><a href=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjgraznHk4GwR-5XkHHCVzBqq-F-XC11rdag5OlGe57beIUIc_yDZsZkcmALVd7ed_x-Cy_YNBxPUuu7xj3K_rXDrmBC_jvm8VxgZBSSSZ04TRe5NkXN5fsRHAuleUXZdFe6O0JiVFpa5U24bz2xAfQqqvDqTz5IN5wanVwDYRsQ9SJGYO09U4E3bCr5w\/s600\/image6.png\" style=\"margin-left: auto; margin-right: auto;\"><img decoding=\"async\" border=\"0\" data-original-height=\"371\" data-original-width=\"600\" src=\"https:\/\/blogger.googleusercontent.com\/img\/b\/R29vZ2xl\/AVvXsEjgraznHk4GwR-5XkHHCVzBqq-F-XC11rdag5OlGe57beIUIc_yDZsZkcmALVd7ed_x-Cy_YNBxPUuu7xj3K_rXDrmBC_jvm8VxgZBSSSZ04TRe5NkXN5fsRHAuleUXZdFe6O0JiVFpa5U24bz2xAfQqqvDqTz5IN5wanVwDYRsQ9SJGYO09U4E3bCr5w\/s16000\/image6.png\"\/><\/a><\/td>\n<\/tr>\n<tr>\n<td class=\"tr-caption\" style=\"text-align: center;\">Ablation studies showing that fewer cross-attention connections (1-2) are better for retrieval tasks (<b>top<\/b>), whereas more connections favor text-generative tasks such as VQA (<b>bottom<\/b>).<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Conclusion<\/h2>\n<p>\nWe presented MaMMUT, a simple and compact vision-encoder language-decoder model that jointly trains a number of conflicting objectives to reconcile contrastive-like and text-generative tasks. Our model also serves as a foundation for many more vision-language tasks, achieving state-of-the-art or competitive performance on image-text and text-image retrieval, videoQA, video captioning, open-vocabulary detection and VQA. We hope it can be further used for more multimodal applications.\n<\/p>\n<h2>Acknowledgements<\/h2>\n<p>\n<em>The work described is co-authored by: Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui, and Anelia Angelova. We would like to thank Mojtaba Seyedhosseini, Vijay Vasudevan, Priya Goyal, Jiahui Yu, Zirui Wang, Yonghui Wu, Runze Li, Jie Mei, Radu Soricut, Qingqing Huang, Andy Ly, Nan Du, Yuxin Wu, Tom Duerig, Paul Natsev, Zoubin Ghahramani for their help and support. <\/em>\n<\/p>\n<\/div>\n<p>[ad_2]<br \/>\n<br \/><a href=\"http:\/\/ai.googleblog.com\/2023\/05\/mammut-simple-vision-encoder-text.html\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] Posted by AJ Piergiovanni and Anelia Angelova, Research Scientists, Google Research Vision-language foundational models are built on<\/p>\n","protected":false},"author":2,"featured_media":533,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20],"tags":[],"class_list":["post-532","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-google-ai"],"_links":{"self":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/532","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/comments?post=532"}],"version-history":[{"count":1,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/532\/revisions"}],"predecessor-version":[{"id":2806,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/532\/revisions\/2806"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media\/533"}],"wp:attachment":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media?parent=532"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/categories?post=532"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/tags?post=532"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}