MeLo: Low-rank Adaptation is Better than Fine-tuning for Medical Image Diagnosis

Read original: arXiv:2311.08236 - Published 7/23/2024 by Yitao Zhu, Zhenrong Shen, Zihao Zhao, Sheng Wang, Xin Wang, Xiangyu Zhao, Dinggang Shen, Qian Wang

🖼️

Overview

Current practice in computer-aided diagnosis (CAD) models involves fine-tuning pre-trained Vision Transformer (ViT) models, but this can be resource-intensive.
Real-world deployment of multiple CAD models can be problematic due to limited storage space and time-consuming model switching.
The paper proposes a new method called MeLo (Medical image Low-rank adaptation) to address these challenges.

Plain English Explanation

The researchers recognized that the typical way of developing CAD models using transformer architectures can be costly and impractical. ViT models have become much larger, making them less accessible to medical imaging communities. Additionally, in real-world settings, having multiple CAD models can create problems, such as limited storage space and slow model switching.

To solve these issues, the researchers developed a new method called MeLo. Instead of the resource-intensive process of fine-tuning, MeLo uses a low-rank adaptation approach. This means they fix the weights of the ViT model and only add small, low-rank "plug-ins" to adapt the model to different medical imaging tasks.

By using this lightweight approach, the researchers were able to achieve comparable performance to fully fine-tuned ViT models on four different medical imaging datasets, but with only 0.17% of the trainable parameters. Additionally, MeLo adds very little storage space (about 0.5MB) and allows for extremely fast model switching during deployment and inference.

Technical Explanation

The researchers propose a new method called MeLo (Medical image Low-rank adaptation) to address the challenges of developing and deploying computer-aided diagnosis (CAD) models based on transformer architectures.

Instead of the common practice of fine-tuning from ImageNet pre-trained weights, MeLo adopts a low-rank adaptation approach. This involves fixing the weights of the Vision Transformer (ViT) model and only adding small, low-rank "plug-ins" to adapt the model to different medical imaging tasks.

The researchers evaluate MeLo on four distinct medical imaging datasets and show that it can achieve comparable performance to fully fine-tuned ViT models, but with only around 0.17% of the trainable parameters. This lightweight approach also adds minimal storage overhead (about 0.5MB) and enables extremely fast model switching during deployment and inference.

Critical Analysis

The researchers acknowledge that while MeLo provides a more efficient and practical solution for developing and deploying CAD models based on transformer architectures, there may be some limitations to their approach.

One potential concern is that by fixing the weights of the ViT model and only adding small, low-rank plug-ins, the model may not be able to fully capture the nuances and complexity of different medical imaging tasks. This could result in a slight performance trade-off compared to fully fine-tuned models.

Additionally, the researchers do not provide a comprehensive analysis of the types of medical imaging tasks and datasets where MeLo would be most effective. It would be helpful to understand the specific scenarios or use cases where this approach shines, as well as any limitations or potential issues that may arise in certain domains.

Overall, the MeLo method represents an interesting and practical solution for deploying CAD models in real-world settings, but further research and testing may be needed to fully understand its strengths, weaknesses, and the breadth of its applicability.

Conclusion

The paper presents a novel method called MeLo (Medical image Low-rank adaptation) that addresses the challenges of developing and deploying computer-aided diagnosis (CAD) models based on transformer architectures.

By using a lightweight, low-rank adaptation approach instead of resource-demanding fine-tuning, MeLo achieves comparable performance to fully fine-tuned Vision Transformer (ViT) models, but with only a fraction of the trainable parameters. This makes MeLo a practical and efficient solution for real-world deployments, where storage space and model switching speed are important considerations.

The researchers have made their source code and pre-trained weights publicly available, which will likely be of great interest to the medical imaging community as they explore more efficient ways to develop and deploy CAD models for various clinical tasks across different imaging modalities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🖼️

MeLo: Low-rank Adaptation is Better than Fine-tuning for Medical Image Diagnosis

Yitao Zhu, Zhenrong Shen, Zihao Zhao, Sheng Wang, Xin Wang, Xiangyu Zhao, Dinggang Shen, Qian Wang

The common practice in developing computer-aided diagnosis (CAD) models based on transformer architectures usually involves fine-tuning from ImageNet pre-trained weights. However, with recent advances in large-scale pre-training and the practice of scaling laws, Vision Transformers (ViT) have become much larger and less accessible to medical imaging communities. Additionally, in real-world scenarios, the deployments of multiple CAD models can be troublesome due to problems such as limited storage space and time-consuming model switching. To address these challenges, we propose a new method MeLo (Medical image Low-rank adaptation), which enables the development of a single CAD model for multiple clinical tasks in a lightweight manner. It adopts low-rank adaptation instead of resource-demanding fine-tuning. By fixing the weight of ViT models and only adding small low-rank plug-ins, we achieve competitive results on various diagnosis tasks across different imaging modalities using only a few trainable parameters. Specifically, our proposed method achieves comparable performance to fully fine-tuned ViT models on four distinct medical imaging datasets using about 0.17% trainable parameters. Moreover, MeLo adds only about 0.5MB of storage space and allows for extremely fast model switching in deployment and inference. Our source code and pre-trained weights are available on our website (https://absterzhu.github.io/melo.github.io/).

7/23/2024

A Large-scale Medical Visual Task Adaptation Benchmark

Shentong Mo, Xufang Luo, Yansen Wang, Dongsheng Li

Visual task adaptation has been demonstrated to be effective in adapting pre-trained Vision Transformers (ViTs) to general downstream visual tasks using specialized learnable layers or tokens. However, there is yet a large-scale benchmark to fully explore the effect of visual task adaptation on the realistic and important medical domain, particularly across diverse medical visual modalities, such as color images, X-ray, and CT. To close this gap, we present Med-VTAB, a large-scale Medical Visual Task Adaptation Benchmark consisting of 1.68 million medical images for diverse organs, modalities, and adaptation approaches. Based on Med-VTAB, we explore the scaling law of medical prompt tuning concerning tunable parameters and the generalizability of medical visual adaptation using non-medical/medical pre-train weights. Besides, we study the impact of patient ID out-of-distribution on medical visual adaptation, which is a real and challenging scenario. Furthermore, results from Med-VTAB indicate that a single pre-trained model falls short in medical task adaptation. Therefore, we introduce GMoE-Adapter, a novel method that combines medical and general pre-training weights through a gated mixture-of-experts adapter, achieving state-of-the-art results in medical visual task adaptation.

4/22/2024

➖

MoVL:Exploring Fusion Strategies for the Domain-Adaptive Application of Pretrained Models in Medical Imaging Tasks

Haijiang Tian, Jingkun Yue, Xiaohong Liu, Guoxing Yang, Zeyu Jiang, Guangyu Wang

Medical images are often more difficult to acquire than natural images due to the specialism of the equipment and technology, which leads to less medical image datasets. So it is hard to train a strong pretrained medical vision model. How to make the best of natural pretrained vision model and adapt in medical domain still pends. For image classification, a popular method is linear probe (LP). However, LP only considers the output after feature extraction. Yet, there exists a gap between input medical images and natural pretrained vision model. We introduce visual prompting (VP) to fill in the gap, and analyze the strategies of coupling between LP and VP. We design a joint learning loss function containing categorisation loss and discrepancy loss, which describe the variance of prompted and plain images, naming this joint training strategy MoVL (Mixture of Visual Prompting and Linear Probe). We experiment on 4 medical image classification datasets, with two mainstream architectures, ResNet and CLIP. Results shows that without changing the parameters and architecture of backbone model and with less parameters, there is potential for MoVL to achieve full finetune (FF) accuracy (on four medical datasets, average 90.91% for MoVL and 91.13% for FF). On out of distribution medical dataset, our method(90.33%) can outperform FF (85.15%) with absolute 5.18 % lead.

5/14/2024

Mixture of Low-rank Experts for Transferable AI-Generated Image Detection

Zihan Liu, Hanyi Wang, Yaoyu Kang, Shilin Wang

Generative models have shown a giant leap in synthesizing photo-realistic images with minimal expertise, sparking concerns about the authenticity of online information. This study aims to develop a universal AI-generated image detector capable of identifying images from diverse sources. Existing methods struggle to generalize across unseen generative models when provided with limited sample sources. Inspired by the zero-shot transferability of pre-trained vision-language models, we seek to harness the nontrivial visual-world knowledge and descriptive proficiency of CLIP-ViT to generalize over unknown domains. This paper presents a novel parameter-efficient fine-tuning approach, mixture of low-rank experts, to fully exploit CLIP-ViT's potential while preserving knowledge and expanding capacity for transferable detection. We adapt only the MLP layers of deeper ViT blocks via an integration of shared and separate LoRAs within an MoE-based structure. Extensive experiments on public benchmarks show that our method achieves superiority over state-of-the-art approaches in cross-generator generalization and robustness to perturbations. Remarkably, our best-performing ViT-L/14 variant requires training only 0.08% of its parameters to surpass the leading baseline by +3.64% mAP and +12.72% avg.Acc across unseen diffusion and autoregressive models. This even outperforms the baseline with just 0.28% of the training data. Our code and pre-trained models will be available at https://github.com/zhliuworks/CLIPMoLE.

4/9/2024