Multimodal Unsupervised Domain Generalization by Retrieving Across the Modality Gap

2402.04416

Published 5/30/2024 by Christopher Liao, Christian So, Theodoros Tsiligkaridis, Brian Kulis

Multimodal Unsupervised Domain Generalization by Retrieving Across the Modality Gap

Abstract

Domain generalization (DG) is an important problem that learns a model which generalizes to unseen test domains leveraging one or more source domains, under the assumption of shared label spaces. However, most DG methods assume access to abundant source data in the target label space, a requirement that proves overly stringent for numerous real-world applications, where acquiring the same label space as the target task is prohibitively expensive. For this setting, we tackle the multimodal version of the unsupervised domain generalization (MUDG) problem, which uses a large task-agnostic unlabeled source dataset during finetuning. Our framework does not explicitly assume any relationship between the source dataset and target task. Instead, it relies only on the premise that the source dataset can be accurately and efficiently searched in a joint vision-language space. We make three contributions in the MUDG setting. Firstly, we show theoretically that cross-modal approximate nearest neighbor search suffers from low recall due to the large distance between text queries and the image centroids used for coarse quantization. Accordingly, we propose paired k-means, a simple clustering algorithm that improves nearest neighbor recall by storing centroids in query space instead of image space. Secondly, we propose an adaptive text augmentation scheme for target labels designed to improve zero-shot accuracy and diversify retrieved image data. Lastly, we present two simple but effective components to further improve downstream target accuracy. We compare against state-of-the-art name-only transfer, source-free DG and zero-shot (ZS) methods on their respective benchmarks and show consistent improvement in accuracy on 20 diverse datasets. Code is available: https://github.com/Chris210634/mudg

Create account to get full access

Overview

The paper proposes a data-centric approach for unsupervised domain generalization using large-scale multimodal web data.
It aims to improve the ability of machine learning models to perform well on new, unseen domains without requiring labeled data from those domains.
The approach involves retrieving relevant multimodal data from the web and using it to train models that can generalize to diverse real-world scenarios.

Plain English Explanation

One of the key challenges in machine learning is getting models to perform well on new data that is different from the training data they were exposed to. This is known as the domain generalization problem. Towards Generalizing to Unseen Domains: Few-Label Domain Generalization and Less but Better: Enabling Generalized Zero-Shot Learning are two examples of research tackling this challenge.

In this paper, the authors propose a new approach that leverages the vast amounts of multimodal (text, images, etc.) data available on the internet. The idea is to retrieve relevant web data and use it to train models that can then generalize well to new, unseen domains. This is especially useful when you don't have labeled data for the target domains.

The key innovation is the "data-centric" aspect - instead of just focusing on model architecture or optimization techniques, the authors emphasize the importance of the training data itself. By carefully selecting and curating web-scale multimodal data, they aim to create models that are more robust and adaptable to diverse real-world scenarios.

Technical Explanation

The paper introduces the concept of Multimodal Unsupervised Domain Generalization (MUDG). This builds on prior work in MultiMatch: A Unified Framework for Multi-Task Learning and Semi-Supervised Domain Adaptation and Language-Guided Domain Generalized Medical Image Segmentation, which explored domain generalization for specific tasks like image classification and segmentation.

The key technical contribution is a Retrieval-based Multimodal Unsupervised Domain Generalization (RMUDG) framework. This involves:

Multimodal Data Retrieval: Leveraging large-scale web data to retrieve relevant multimodal examples for a target task or domain.
Multimodal Contrastive Learning: Using contrastive learning techniques to train models that can extract meaningful representations from the retrieved multimodal data.
Domain-Agnostic Model Fine-Tuning: Fine-tuning the trained model on limited labeled data from the target domain(s) to further improve its performance.

The authors demonstrate the effectiveness of their approach on several benchmark datasets, showing that it can outperform existing domain generalization methods, especially when the target domain is quite different from the source domains.

Critical Analysis

The paper makes a compelling case for the value of a data-centric approach to domain generalization. By carefully curating large-scale multimodal web data, the authors show that models can be trained to perform well on a wide range of unseen domains.

However, a potential limitation is the reliance on web data, which may introduce biases or noise that could negatively impact model performance. The authors acknowledge this and suggest further research is needed to better understand and mitigate these issues.

Additionally, the paper focuses on relatively simple tasks like image classification. It would be interesting to see how the RMUDG framework would scale to more complex, real-world applications, such as Grounding Stylistic Domain Generalization: A Quantitative Domain Shift Analysis.

Overall, this research represents an important step forward in the field of domain generalization and highlights the potential of leveraging large-scale web data to build more robust and adaptable machine learning models.

Conclusion

The key contribution of this paper is a data-centric approach to unsupervised domain generalization that leverages large-scale multimodal web data. By carefully retrieving and curating relevant examples, the authors demonstrate that models can be trained to generalize well to new, unseen domains without requiring labeled data from those domains.

This research has important implications for the development of more robust and adaptable machine learning systems, which can be critical for real-world applications that need to perform well across diverse scenarios. While the paper focuses on relatively simple tasks, the underlying principles could potentially be extended to more complex domains, further expanding the reach and impact of this work.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🖼️

MultiMatch: Multi-task Learning for Semi-supervised Domain Generalization

Lei Qi, Hongpeng Yang, Yinghuan Shi, Xin Geng

Domain generalization (DG) aims at learning a model on source domains to well generalize on the unseen target domain. Although it has achieved great success, most of existing methods require the label information for all training samples in source domains, which is time-consuming and expensive in the real-world application. In this paper, we resort to solving the semi-supervised domain generalization (SSDG) task, where there are a few label information in each source domain. To address the task, we first analyze the theory of the multi-domain learning, which highlights that 1) mitigating the impact of domain gap and 2) exploiting all samples to train the model can effectively reduce the generalization error in each source domain so as to improve the quality of pseudo-labels. According to the analysis, we propose MultiMatch, i.e., extending FixMatch to the multi-task learning framework, producing the high-quality pseudo-label for SSDG. To be specific, we consider each training domain as a single task (i.e., local task) and combine all training domains together (i.e., global task) to train an extra task for the unseen test domain. In the multi-task framework, we utilize the independent BN and classifier for each task, which can effectively alleviate the interference from different domains during pseudo-labeling. Also, most of parameters in the framework are shared, which can be trained by all training samples sufficiently. Moreover, to further boost the pseudo-label accuracy and the model's generalization, we fuse the predictions from the global task and local task during training and testing, respectively. A series of experiments validate the effectiveness of the proposed method, and it outperforms the existing semi-supervised methods and the SSDG method on several benchmark DG datasets.

4/30/2024

cs.CV

Towards Generalizing to Unseen Domains with Few Labels

Chamuditha Jayanga Galappaththige, Sanoojan Baliah, Malitha Gunawardhana, Muhammad Haris Khan

We approach the challenge of addressing semi-supervised domain generalization (SSDG). Specifically, our aim is to obtain a model that learns domain-generalizable features by leveraging a limited subset of labelled data alongside a substantially larger pool of unlabeled data. Existing domain generalization (DG) methods which are unable to exploit unlabeled data perform poorly compared to semi-supervised learning (SSL) methods under SSDG setting. Nevertheless, SSL methods have considerable room for performance improvement when compared to fully-supervised DG training. To tackle this underexplored, yet highly practical problem of SSDG, we make the following core contributions. First, we propose a feature-based conformity technique that matches the posterior distributions from the feature space with the pseudo-label from the model's output space. Second, we develop a semantics alignment loss to learn semantically-compatible representations by regularizing the semantic structure in the feature space. Our method is plug-and-play and can be readily integrated with different SSL-based SSDG baselines without introducing any additional parameters. Extensive experimental results across five challenging DG benchmarks with four strong SSL baselines suggest that our method provides consistent and notable gains in two different SSDG settings.

5/8/2024

cs.CV

Less but Better: Enabling Generalized Zero-shot Learning Towards Unseen Domains by Intrinsic Learning from Redundant LLM Semantics

Jiaqi Yue, Jiancheng Zhao, Chunhui Zhao

Generalized zero-shot learning (GZSL) focuses on recognizing seen and unseen classes against domain shift problem (DSP) where data of unseen classes may be misclassified as seen classes. However, existing GZSL is still limited to seen domains. In the current work, we pioneer cross-domain GZSL (CDGZSL) which addresses GZSL towards unseen domains. Different from existing GZSL methods which alleviate DSP by generating features of unseen classes with semantics, CDGZSL needs to construct a common feature space across domains and acquire the corresponding intrinsic semantics shared among domains to transfer from seen to unseen domains. Considering the information asymmetry problem caused by redundant class semantics annotated with large language models (LLMs), we present Meta Domain Alignment Semantic Refinement (MDASR). Technically, MDASR consists of two parts: Inter-class Similarity Alignment (ISA), which eliminates the non-intrinsic semantics not shared across all domains under the guidance of inter-class feature relationships, and Unseen-class Meta Generation (UMG), which preserves intrinsic semantics to maintain connectivity between seen and unseen classes by simulating feature generation. MDASR effectively aligns the redundant semantic space with the common feature space, mitigating the information asymmetry in CDGZSL. The effectiveness of MDASR is demonstrated on the Office-Home and Mini-DomainNet, and we have shared the LLM-based semantics for these datasets as the benchmark.

5/24/2024

cs.CV

Language Guided Domain Generalized Medical Image Segmentation

Shahina Kunhimon, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan

Single source domain generalization (SDG) holds promise for more reliable and consistent image segmentation across real-world clinical settings particularly in the medical domain, where data privacy and acquisition cost constraints often limit the availability of diverse datasets. Depending solely on visual features hampers the model's capacity to adapt effectively to various domains, primarily because of the presence of spurious correlations and domain-specific characteristics embedded within the image features. Incorporating text features alongside visual features is a potential solution to enhance the model's understanding of the data, as it goes beyond pixel-level information to provide valuable context. Textual cues describing the anatomical structures, their appearances, and variations across various imaging modalities can guide the model in domain adaptation, ultimately contributing to more robust and consistent segmentation. In this paper, we propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features to learn a more robust feature representation. We assess the effectiveness of our text-guided contrastive feature alignment technique in various scenarios, including cross-modality, cross-sequence, and cross-site settings for different segmentation tasks. Our approach achieves favorable performance against existing methods in literature. Our code and model weights are available at https://github.com/ShahinaKK/LG_SDG.git.

4/4/2024

cs.CV