RWKV-CLIP: A Robust Vision-Language Representation Learner






Published 6/12/2024 by Tiancheng Gu, Kaicheng Yang, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng
RWKV-CLIP: A Robust Vision-Language Representation Learner


Contrastive Language-Image Pre-training (CLIP) has significantly improved performance in various vision-language tasks by expanding the dataset with image-text pairs obtained from websites. This paper further explores CLIP from the perspectives of data and model architecture. To address the prevalence of noisy data and enhance the quality of large-scale image-text data crawled from the internet, we introduce a diverse description generation framework that can leverage Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags. Furthermore, we propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs. Comprehensive experiments across various model scales and pre-training datasets demonstrate that RWKV-CLIP is a robust and efficient vision-language representation learner, it achieves state-of-the-art performance in several downstream tasks, including linear probe, zero-shot classification, and zero-shot image-text retrieval. To facilitate future research, the code and pre-trained models are released at

Create account to get full access


If you already have an account, we'll log you in


  • The paper presents RWKV-CLIP, a novel vision-language representation learning model that aims to be more robust and versatile than existing models like CLIP.
  • RWKV-CLIP combines the strengths of the RWKV language model and the CLIP vision-language model to create a powerful multimodal representation.
  • The model is trained on a large-scale dataset and shows strong performance on a variety of vision-language tasks, including image-text retrieval, zero-shot image classification, and open-ended captioning.

Plain English Explanation

RWKV-CLIP is a new type of AI model that can understand both images and text. It was created by combining two existing AI models - RWKV, which is good at understanding and generating text, and CLIP, which is good at understanding the contents of images.

The researchers behind RWKV-CLIP trained this new model on a huge amount of data, including images and the text that describes them. This allowed the model to learn powerful representations that can connect visual information with language in a very sophisticated way.

As a result, RWKV-CLIP can do a variety of tasks really well, like finding images that match a given text description, classifying images into different categories without being explicitly trained on those categories, and even generating captions to describe images in an open-ended way.

The key innovation of RWKV-CLIP is that it is more "robust" than previous vision-language models. This means it is able to handle a wider range of inputs and perform well on a broader set of tasks, without sacrificing performance. This could make RWKV-CLIP a very useful tool for a variety of real-world applications that involve understanding both images and text.

Technical Explanation

The RWKV-CLIP model combines the strengths of the RWKV language model and the CLIP vision-language model. RWKV is a powerful autoregressive language model that uses a novel attention mechanism called "Recurrent Weight-Tied Transformer", while CLIP is a contrastive vision-language model that has shown impressive zero-shot capabilities.

The researchers trained RWKV-CLIP on a large-scale dataset that includes both images and the text that describes them. This allowed the model to learn rich multimodal representations that can effectively bridge the gap between visual and linguistic information.

On the architecture side, RWKV-CLIP uses a RWKV-based text encoder and a CLIP-based image encoder. These two encoders are joined by a multimodal fusion module that learns to align the visual and textual features. This allows the model to perform a variety of vision-language tasks, including image-text retrieval, zero-shot image classification, and open-ended image captioning.

The experiments in the paper demonstrate that RWKV-CLIP outperforms existing vision-language models on a wide range of benchmark tasks, while also exhibiting greater robustness and versatility. This suggests that the combination of RWKV and CLIP is a fruitful approach for building powerful multimodal AI systems.

Critical Analysis

The paper does a thorough job of evaluating RWKV-CLIP and comparing it to state-of-the-art vision-language models. The results show clear improvements in performance, which is impressive. However, the paper does not delve deeply into the potential limitations or caveats of the approach.

For example, the training dataset used for RWKV-CLIP is not described in detail, and there are concerns about the biases and pitfalls that can arise from large-scale web-crawled data. The paper also does not address potential issues around the model's interpretability or the transparency of its decision-making process.

Additionally, while the paper highlights the versatility of RWKV-CLIP, it would be valuable to understand the model's limitations and the types of tasks or scenarios where it may not perform as well. Further research is needed to fully assess the robustness and generalizability of the approach.

Overall, the RWKV-CLIP model represents an interesting and promising step forward in multimodal AI. However, as with any new technology, it will be important to continue studying its capabilities, limitations, and potential societal impacts in a thoughtful and critical manner.


The RWKV-CLIP model presented in this paper is a significant advancement in the field of vision-language representation learning. By combining the strengths of the RWKV language model and the CLIP vision-language model, the researchers have created a powerful and versatile multimodal AI system.

The strong performance of RWKV-CLIP across a variety of benchmark tasks, including image-text retrieval, zero-shot image classification, and open-ended captioning, suggests that this approach has the potential to unlock new capabilities in real-world applications that require understanding both visual and linguistic information.

While the paper raises some important questions about the model's limitations and potential biases, the overall contribution represents an exciting step forward in the quest to develop AI systems that can seamlessly integrate and reason about multiple modalities of information. As the field of multimodal AI continues to evolve, RWKV-CLIP may serve as an influential and impactful model for future research and development.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

RankCLIP: Ranking-Consistent Language-Image Pretraining

RankCLIP: Ranking-Consistent Language-Image Pretraining

Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zhili Feng, Zenghui Ding, Yining Sun





Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RANKCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RANKCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RANKCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.

Read more



CLIP-KD: An Empirical Study of CLIP Model Distillation

Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, Boyu Diao, Yongjun Xu





Contrastive Language-Image Pre-training (CLIP) has become a promising language-supervised visual pre-training framework. This paper aims to distill small CLIP models supervised by a large teacher CLIP model. We propose several distillation strategies, including relation, feature, gradient and contrastive paradigms, to examine the effectiveness of CLIP-Knowledge Distillation (KD). We show that a simple feature mimicry with Mean Squared Error loss works surprisingly well. Moreover, interactive contrastive learning across teacher and student encoders is also effective in performance improvement. We explain that the success of CLIP-KD can be attributed to maximizing the feature similarity between teacher and student. The unified method is applied to distill several student models trained on CC3M+12M. CLIP-KD improves student CLIP models consistently over zero-shot ImageNet classification and cross-modal retrieval benchmarks. When using ViT-L/14 pretrained on Laion-400M as the teacher, CLIP-KD achieves 57.5% and 55.4% zero-shot top-1 ImageNet accuracy over ViT-B/16 and ResNet-50, surpassing the original CLIP without KD by 20.5% and 20.1% margins, respectively. Our code is released on

Read more


Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Samuel Lavoie, Polina Kirichenko, Mark Ibrahim, Mahmoud Assran, Andrew Gordon Wilson, Aaron Courville, Nicolas Ballas





There are a thousand ways to caption an image. Contrastive Language Pretraining (CLIP) on the other hand, works by mapping an image and its caption to a single vector -- limiting how well CLIP-like models can represent the diverse ways to describe an image. In this work, we introduce Llip, Latent Language Image Pretraining, which models the diversity of captions that could match an image. Llip's vision encoder outputs a set of visual features that are mixed into a final representation by conditioning on information derived from the text. We show that Llip outperforms non-contextualized baselines like CLIP and SigLIP on a variety of tasks even with large-scale encoders. Llip improves zero-shot classification by an average of 2.9% zero-shot classification benchmarks with a ViT-G/14 encoder. Specifically, Llip attains a zero-shot top-1 accuracy of 83.5% on ImageNet outperforming a similarly sized CLIP by 1.4%. We also demonstrate improvement on zero-shot retrieval on MS-COCO by 6.0%. We provide a comprehensive analysis of the components introduced by the method and demonstrate that Llip leads to richer visual representations.

Read more



Demystifying CLIP Data

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer





Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data by filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in our pursuit of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution. Our experimental study rigorously isolates the model and training settings, concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP's data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining the same training budget, attains 72.4%. Our observations hold across various model sizes, exemplified by ViT-H achieving 80.5%, without any bells-and-whistles. Curation code and training data distribution on metadata is made available at

Read more
