COURIER: Contrastive User Intention Reconstruction for Large-Scale Visual Recommendation

Read original: arXiv:2306.05001 - Published 6/7/2024 by Jia-Qi Yang, Chenglei Dai, Dan OU, Dongshuai Li, Ju Huang, De-Chuan Zhan, Xiaoyi Zeng, Yang Yang

✨

Overview

Online retail is heavily influenced by visual characteristics, and incorporating visual features can improve click-through rate (CTR) predictions.
Existing image feature pre-training methods have limitations for CTR prediction in recommendation systems, as they focus on cross-modal predictions rather than directly using visual features as model inputs.
The paper proposes a novel visual feature pre-training method tailored for recommendation systems, which mines visual features related to user interests from behavior histories.

Plain English Explanation

When people shop online, the visual appearance of products can significantly impact whether they choose to click on them or not. Incorporating visual features into recommendation systems could, therefore, help improve the prediction of click-through rates (CTR) - a key metric for online retailers.

However, the researchers found that simply using image embeddings from existing pre-training methods only provided marginal improvements. This is because those methods are designed more for predicting relationships between different types of content (like text and images), rather than directly using visual features to predict user behavior, as is needed for recommendation systems.

To address this, the researchers developed a new approach that specifically learns visual features related to user interests. By looking at users' past browsing and purchase histories, the method can identify which visual elements are most relevant to their preferences. It then uses a contrastive learning technique to train a model that can effectively capture those user-specific visual interests.

The researchers tested this new method on public datasets as well as their own production system, and found it could achieve significant improvements in CTR prediction and overall sales metrics compared to existing approaches. This suggests that tailoring visual feature learning to the specific needs of recommendation systems can be a powerful way to enhance their performance.

Technical Explanation

The paper proposes a novel visual feature pre-training method for click-through rate (CTR) prediction in recommendation systems. Existing image feature pre-training methods, such as CLIP-based techniques, focus on cross-modal prediction tasks, which differ significantly from the goal of CTR prediction in recommendation systems.

The key innovation of the proposed method is an "user intention reconstruction module" that mines visual features related to user interests from behavior histories. This creates a many-to-one correspondence between visual elements and user preferences. The researchers also introduce a contrastive training approach to learn these user intentions and prevent the collapse of embedding vectors.

Extensive experiments on public datasets and the researchers' own production system demonstrate that this tailored visual feature pre-training can lead to substantial improvements in offline AUC (+0.46%) and online Taobao GMV (+0.88%) metrics, with statistical significance. This suggests that designing visual feature learning specifically for the recommendation task, rather than relying on generic pre-training methods, can be a promising direction for enhancing click-through rate prediction in online retail.

Critical Analysis

The paper presents a compelling approach to leveraging visual features for improved CTR prediction in recommendation systems. By focusing on learning visual representations that are directly relevant to user preferences, the proposed method appears to outperform generic image feature pre-training techniques.

However, the paper does not provide a detailed analysis of the limitations of this approach. For example, it is unclear how well the method would generalize to scenarios with sparse user behavior data, or how robust it is to noisy or ambiguous visual signals. Additionally, the paper does not address potential privacy concerns around mining user behavior data to inform visual feature learning.

Further research could explore ways to make the visual feature learning process more interpretable, allowing practitioners to understand which visual elements are most predictive of user engagement. This could lead to additional opportunities for optimizing the user experience, beyond just improving CTR prediction.

Overall, the paper makes a strong case for the value of tailoring visual feature learning to the specific needs of recommendation systems, and the results suggest this is a promising direction for future work in this area.

Conclusion

This paper presents a novel visual feature pre-training method for improving click-through rate (CTR) prediction in online retail recommendation systems. By mining visual features related to user interests from behavior histories, the proposed approach can learn representations that are more directly relevant to the downstream task, outperforming generic image feature pre-training techniques.

The experimental results demonstrate significant improvements in both offline and online performance metrics, suggesting that designing visual feature learning specifically for recommendation systems, rather than relying on generic computer vision models, can be a powerful way to enhance the user experience and drive business outcomes in e-commerce. This work highlights the importance of tailoring machine learning solutions to the unique needs of the application domain.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

✨

COURIER: Contrastive User Intention Reconstruction for Large-Scale Visual Recommendation

Jia-Qi Yang, Chenglei Dai, Dan OU, Dongshuai Li, Ju Huang, De-Chuan Zhan, Xiaoyi Zeng, Yang Yang

With the advancement of multimedia internet, the impact of visual characteristics on the decision of users to click or not within the online retail industry is increasingly significant. Thus, incorporating visual features is a promising direction for further performance improvements in click-through rate (CTR). However, experiments on our production system revealed that simply injecting the image embeddings trained with established pre-training methods only has marginal improvements. We believe that the main advantage of existing image feature pre-training methods lies in their effectiveness for cross-modal predictions. However, this differs significantly from the task of CTR prediction in recommendation systems. In recommendation systems, other modalities of information (such as text) can be directly used as features in downstream models. Even if the performance of cross-modal prediction tasks is excellent, it is challenging to provide significant information gain for the downstream models. We argue that a visual feature pre-training method tailored for recommendation is necessary for further improvements beyond existing modality features. To this end, we propose an effective user intention reconstruction module to mine visual features related to user interests from behavior histories, which constructs a many-to-one correspondence. We further propose a contrastive training method to learn the user intentions and prevent the collapse of embedding vectors. We conduct extensive experimental evaluations on public datasets and our production system to verify that our method can learn users' visual interests. Our method achieves $0.46%$ improvement in offline AUC and $0.88%$ improvement in Taobao GMV (Cross Merchandise Volume) with p-value$<$0.01.

6/7/2024

Movie Recommendation with Poster Attention via Multi-modal Transformer Feature Fusion

Linhan Xia, Yicheng Yang, Ziou Chen, Zheng Yang, Shengxin Zhu

Pre-trained models learn general representations from large datsets which can be fine-turned for specific tasks to significantly reduce training time. Pre-trained models like generative pretrained transformers (GPT), bidirectional encoder representations from transformers (BERT), vision transfomers (ViT) have become a cornerstone of current research in machine learning. This study proposes a multi-modal movie recommendation system by extract features of the well designed posters for each movie and the narrative text description of the movie. This system uses the BERT model to extract the information of text modality, the ViT model applied to extract the information of poster/image modality, and the Transformer architecture for feature fusion of all modalities to predict users' preference. The integration of pre-trained foundational models with some smaller data sets in downstream applications capture multi-modal content features in a more comprehensive manner, thereby providing more accurate recommendations. The efficiency of the proof-of-concept model is verified by the standard benchmark problem the MovieLens 100K and 1M datasets. The prediction accuracy of user ratings is enhanced in comparison to the baseline algorithm, thereby demonstrating the potential of this cross-modal algorithm to be applied for movie or video recommendation.

7/15/2024

New!Multi-intent Aware Contrastive Learning for Sequential Recommendation

Junshu Huang, Zi Long, Xianghua Fu, Yin Chen

Intent is a significant latent factor influencing user-item interaction sequences. Prevalent sequence recommendation models that utilize contrastive learning predominantly rely on single-intent representations to direct the training process. However, this paradigm oversimplifies real-world recommendation scenarios, attempting to encapsulate the diversity of intents within the single-intent level representation. SR models considering multi-intent information in their framework are more likely to reflect real-life recommendation scenarios accurately.

9/16/2024

Multimodal Pretraining, Adaptation, and Generation for Recommendation: A Survey

Qijiong Liu, Jieming Zhu, Yanting Yang, Quanyu Dai, Zhaocheng Du, Xiao-Ming Wu, Zhou Zhao, Rui Zhang, Zhenhua Dong

Personalized recommendation serves as a ubiquitous channel for users to discover information tailored to their interests. However, traditional recommendation models primarily rely on unique IDs and categorical features for user-item matching, potentially overlooking the nuanced essence of raw item contents across multiple modalities such as text, image, audio, and video. This underutilization of multimodal data poses a limitation to recommender systems, especially in multimedia services like news, music, and short-video platforms. The recent advancements in large multimodal models offer new opportunities and challenges in developing content-aware recommender systems. This survey seeks to provide a comprehensive exploration of the latest advancements and future trajectories in multimodal pretraining, adaptation, and generation techniques, as well as their applications in enhancing recommender systems. Furthermore, we discuss current open challenges and opportunities for future research in this dynamic domain. We believe that this survey, alongside the curated resources, will provide valuable insights to inspire further advancements in this evolving landscape.

7/4/2024