Automatic Creative Selection with Cross-Modal Matching

2405.00029

Published 5/2/2024 by Alex Kim, Jia Huang, Rob Monarch, Jerry Kwac, Anikesh Kamath, Parmeshwar Khurd, Kailash Thiyagarajan, Goodman Gu

cs.CV cs.IR

🤯

Abstract

Application developers advertise their Apps by creating product pages with App images, and bidding on search terms. It is then crucial for App images to be highly relevant with the search terms. Solutions to this problem require an image-text matching model to predict the quality of the match between the chosen image and the search terms. In this work, we present a novel approach to matching an App image to search terms based on fine-tuning a pre-trained LXMERT model. We show that compared to the CLIP model and a baseline using a Transformer model for search terms, and a ResNet model for images, we significantly improve the matching accuracy. We evaluate our approach using two sets of labels: advertiser associated (image, search term) pairs for a given application, and human ratings for the relevance between (image, search term) pairs. Our approach achieves 0.96 AUC score for advertiser associated ground truth, outperforming the transformer+ResNet baseline and the fine-tuned CLIP model by 8% and 14%. For human labeled ground truth, our approach achieves 0.95 AUC score, outperforming the transformer+ResNet baseline and the fine-tuned CLIP model by 16% and 17%.

Create account to get full access

Overview

Developers promote their apps by creating product pages with app images and bidding on search terms
Matching app images to search terms is crucial for effective app marketing
This work presents a novel approach to image-text matching using a fine-tuned LXMERT model
The approach outperforms a CLIP model and a baseline using a Transformer for search terms and a ResNet for images

Plain English Explanation

When developers want to promote their mobile apps, they often create product pages with images of the app and bid on specific search terms that potential users might use to find apps like theirs. The key is to make sure the app images are highly relevant to the search terms, so that users see results that are a good match for what they're looking for.

To solve this problem, the researchers in this study developed a new way to predict how well an app image matches a given search term. They used a pre-trained LXMERT model and fine-tuned it for this specific task. Compared to other approaches, like using a CLIP model or a combination of a Transformer model for search terms and a ResNet model for images, their method significantly improved the accuracy of predicting how well an image matches a search term.

The researchers evaluated their approach using two different sets of labels: one provided by the app developers themselves, and one based on ratings from human judges. In both cases, their fine-tuned LXMERT model outperformed the other methods, achieving very high accuracy scores.

Technical Explanation

The researchers in this study developed a novel approach to matching app images to search terms, based on fine-tuning a pre-trained LXMERT model. LXMERT is a large, pre-trained model that can handle both image and text data, making it a good fit for this image-text matching task.

The researchers fine-tuned the LXMERT model using a dataset of (image, search term) pairs provided by app developers. This allowed the model to learn the specific patterns and relationships between app images and the search terms used to promote them.

To evaluate the performance of their approach, the researchers compared it to two other methods: a fine-tuned CLIP model, and a baseline approach that used a Transformer model for the search terms and a ResNet model for the images.

The results showed that the fine-tuned LXMERT model significantly outperformed the other approaches. When evaluated against the developer-provided ground truth, the LXMERT model achieved an AUC (area under the curve) score of 0.96, outperforming the Transformer+ResNet baseline and the fine-tuned CLIP model by 8% and 14%, respectively. For the human-labeled ground truth, the LXMERT model achieved an AUC score of 0.95, outperforming the baseline and CLIP model by 16% and 17%.

Critical Analysis

The researchers provide a comprehensive evaluation of their approach, including comparisons to other state-of-the-art methods. However, there are a few potential limitations and areas for further research that could be considered:

The dataset used for fine-tuning and evaluation was provided by app developers, which may introduce some bias or noise. It would be interesting to see how the model performs on a more curated, third-party dataset of app images and search terms.
The researchers only evaluated the model's performance on a relatively narrow task of matching app images to search terms. It would be valuable to explore how the fine-tuned LXMERT model might generalize to other image-text matching tasks, such as zero-shot concept generation or cross-modal neighbor representation.
While the LXMERT model outperformed the other approaches, it would be interesting to see if combining the strengths of different models, as in the Transformer+ResNet baseline, could lead to even better performance.

Overall, the researchers have presented a promising approach to image-text matching that could have significant practical applications for app developers and digital marketers.

Conclusion

This study introduces a novel method for matching app images to search terms, based on fine-tuning a pre-trained LXMERT model. The approach significantly outperforms other state-of-the-art methods, achieving high accuracy scores on both developer-provided and human-labeled ground truth data.

The researchers' work demonstrates the power of fine-tuning large, pre-trained models for specific tasks, and highlights the importance of image-text matching for effective app marketing. Their findings could have practical applications for app developers, digital marketers, and other industries where matching visual and textual content is crucial.

While the study has some limitations, it opens up interesting avenues for further research, such as exploring the model's performance on other image-text matching tasks and investigating ways to combine the strengths of different approaches. Overall, this work represents an important step forward in the field of cross-modal understanding and its real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤿

Advanced Multimodal Deep Learning Architecture for Image-Text Matching

Jinyin Wang, Haijing Zhang, Yihao Zhong, Yingbin Liang, Rongwei Ji, Yiru Cang

Image-text matching is a key multimodal task that aims to model the semantic association between images and text as a matching relationship. With the advent of the multimedia information age, image, and text data show explosive growth, and how to accurately realize the efficient and accurate semantic correspondence between them has become the core issue of common concern in academia and industry. In this study, we delve into the limitations of current multimodal deep learning models in processing image-text pairing tasks. Therefore, we innovatively design an advanced multimodal deep learning architecture, which combines the high-level abstract representation ability of deep neural networks for visual information with the advantages of natural language processing models for text semantic understanding. By introducing a novel cross-modal attention mechanism and hierarchical feature fusion strategy, the model achieves deep fusion and two-way interaction between image and text feature space. In addition, we also optimize the training objectives and loss functions to ensure that the model can better map the potential association structure between images and text during the learning process. Experiments show that compared with existing image-text matching models, the optimized new model has significantly improved performance on a series of benchmark data sets. In addition, the new model also shows excellent generalization and robustness on large and diverse open scenario datasets and can maintain high matching performance even in the face of previously unseen complex situations.

6/24/2024

cs.LG cs.CL cs.CV

CLIP-Branches: Interactive Fine-Tuning for Text-Image Retrieval

Christian Lulf, Denis Mayr Lima Martins, Marcos Antonio Vaz Salles, Yongluan Zhou, Fabian Gieseke

The advent of text-image models, most notably CLIP, has significantly transformed the landscape of information retrieval. These models enable the fusion of various modalities, such as text and images. One significant outcome of CLIP is its capability to allow users to search for images using text as a query, as well as vice versa. This is achieved via a joint embedding of images and text data that can, for instance, be used to search for similar items. Despite efficient query processing techniques such as approximate nearest neighbor search, the results may lack precision and completeness. We introduce CLIP-Branches, a novel text-image search engine built upon the CLIP architecture. Our approach enhances traditional text-image search engines by incorporating an interactive fine-tuning phase, which allows the user to further concretize the search query by iteratively defining positive and negative examples. Our framework involves training a classification model given the additional user feedback and essentially outputs all positively classified instances of the entire data catalog. By building upon recent techniques, this inference phase, however, is not implemented by scanning the entire data catalog, but by employing efficient index structures pre-built for the data. Our results show that the fine-tuned results can improve the initial search outputs in terms of relevance and accuracy while maintaining swift response times

6/21/2024

cs.IR

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Andreas Koukounas, Georgios Mastrapas, Michael Gunther, Bo Wang, Scott Martens, Isabelle Mohr, Saba Sturua, Mohammad Kalim Akram, Joan Fontanals Mart'inez, Saahil Ognawala, Susana Guzman, Maximilian Werk, Nan Wang, Han Xiao

Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.

6/27/2024

cs.CL cs.AI cs.CV cs.IR

New!Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text Matching

Xuri Ge, Fuhai Chen, Songpei Xu, Fuxiang Tao, Jie Wang, Joemon M. Jose

Image-text matching (ITM) is a fundamental problem in computer vision. The key issue lies in jointly learning the visual and textual representation to estimate their similarity accurately. Most existing methods focus on feature enhancement within modality or feature interaction across modalities, which, however, neglects the contextual information of the object representation based on the inter-object relationships that match the corresponding sentences with rich contextual semantics. In this paper, we propose a Hybrid-modal Interaction with multiple Relational Enhancements (termed textit{Hire}) for image-text matching, which correlates the intra- and inter-modal semantics between objects and words with implicit and explicit relationship modelling. In particular, the explicit intra-modal spatial-semantic graph-based reasoning network is designed to improve the contextual representation of visual objects with salient spatial and semantic relational connectivities, guided by the explicit relationships of the objects' spatial positions and their scene graph. We use implicit relationship modelling for potential relationship interactions before explicit modelling to improve the fault tolerance of explicit relationship detection. Then the visual and textual semantic representations are refined jointly via inter-modal interactive attention and cross-modal alignment. To correlate the context of objects with the textual context, we further refine the visual semantic representation via cross-level object-sentence and word-image-based interactive attention. Extensive experiments validate that the proposed hybrid-modal interaction with implicit and explicit modelling is more beneficial for image-text matching. And the proposed textit{Hire} obtains new state-of-the-art results on MS-COCO and Flickr30K benchmarks.

6/28/2024

cs.CV cs.IR