Feature Extraction for Generative Medical Imaging Evaluation: New Evidence Against an Evolving Trend

Read original: arXiv:2311.13717 - Published 5/30/2024 by McKell Woodland, Austin Castelo, Mais Al Taie, Jessica Albuquerque Marques Silva, Mohamed Eltaher, Frank Mohn, Alexander Shieh, Austin Castelo, Suprateek Kundu, Joshua P. Yung and 2 others

✨

Overview

The paper examines the use of the Fréchet Inception Distance (FID) metric for assessing the quality of synthetic medical images generated by Generative Adversarial Networks (GANs).
FID is a widely used metric for evaluating the quality of synthetic images, but it relies on an ImageNet-based feature extractor, which raises questions about its applicability to medical imaging.
The study compares the performance of ImageNet-based and medical image-based (RadImageNet) feature extractors in computing FID for medical imaging tasks.

Plain English Explanation

The Fréchet Inception Distance (FID) is a popular way to evaluate the quality of synthetic images generated by AI models. It works by comparing the features of the synthetic images to the features of real images. The more similar the features, the higher the quality of the synthetic images.

However, the FID metric relies on a feature extractor trained on the ImageNet dataset, which is a general collection of everyday images. This raises the question of whether the FID metric is suitable for evaluating synthetic medical images, which can have very different visual characteristics compared to everyday images.

To address this, the researchers in this study tried using feature extractors trained on medical images (called RadImageNet) instead of the standard ImageNet-based extractor. They evaluated 16 different GAN models across four medical imaging modalities and four data augmentation techniques, using both ImageNet-based and RadImageNet-based FID scores.

The key finding is that the ImageNet-based FID scores were more consistent with human expert judgments of the synthetic medical image quality, compared to the RadImageNet-based scores. This was surprising, as the researchers had expected the medical image-trained extractor to perform better for this task.

Technical Explanation

The researchers evaluated 16 different StyleGAN2 networks trained on four medical imaging modalities (CT, MRI, X-Ray, and Histology) and four data augmentation techniques. They computed FID scores using eleven different feature extractors - some trained on ImageNet, and some trained on a medical image dataset called RadImageNet.

To assess the validity of the FID scores, the researchers conducted visual Turing tests, where human experts were asked to judge the quality of the synthetic medical images. They found that the FID scores derived from the ImageNet-trained SwAV feature extractor were significantly correlated with the expert evaluations, while the rankings based on RadImageNet-trained extractors were much more volatile and inconsistent with human judgment.

This suggests that, contrary to prevailing assumptions, medical image-trained feature extractors do not inherently improve the reliability of FID for medical imaging tasks. In fact, the researchers found that the ImageNet-based extractors produced FID scores that were more aligned with human perception of synthetic medical image quality.

Critical Analysis

The researchers acknowledge several limitations of their study. First, they only evaluated a single GAN architecture (StyleGAN2) and a limited set of medical imaging modalities and data augmentation techniques. It's possible that the findings may not generalize to other GAN models or a wider range of medical imaging applications.

Additionally, the researchers only used visual Turing tests to assess human judgment, which may not capture all aspects of image quality that are important for medical use cases. Further validation with domain experts performing clinically-relevant tasks could provide additional insights.

Another potential issue is the choice of the RadImageNet dataset for training the medical image-based feature extractors. While this dataset is larger and more diverse than many medical image datasets, it may still not be representative of the full range of medical imaging data and use cases.

Despite these limitations, the study provides valuable evidence challenging the common assumption that medical image-trained feature extractors are inherently superior for evaluating synthetic medical images using the FID metric. The researchers encourage further research to rethink perceptual metrics for medical image translation and to detect potential domain shifts when applying FID and other evaluation metrics across different medical imaging domains.

Conclusion

This study casts doubt on the prevailing assumption that adapting the FID metric to medical imaging by using feature extractors trained on medical image datasets will inherently improve its reliability and alignment with human judgment. The researchers found that ImageNet-based feature extractors actually produced FID scores that were more consistent with expert evaluations of synthetic medical image quality compared to their medical image-trained counterparts.

These findings suggest that the choice of feature extractor for computing FID is a critical factor, and that medical image-specific feature extractors do not necessarily outperform general-purpose extractors for this task. The study encourages further research to better understand the limitations of FID and other perceptual metrics when applied to medical imaging, and to explore alternative evaluation approaches that may be more suitable for this domain.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

✨

Feature Extraction for Generative Medical Imaging Evaluation: New Evidence Against an Evolving Trend

McKell Woodland, Austin Castelo, Mais Al Taie, Jessica Albuquerque Marques Silva, Mohamed Eltaher, Frank Mohn, Alexander Shieh, Austin Castelo, Suprateek Kundu, Joshua P. Yung, Ankit B. Patel, Kristy K. Brock

Fr'echet Inception Distance (FID) is a widely used metric for assessing synthetic image quality. It relies on an ImageNet-based feature extractor, making its applicability to medical imaging unclear. A recent trend is to adapt FID to medical imaging through feature extractors trained on medical images. Our study challenges this practice by demonstrating that ImageNet-based extractors are more consistent and aligned with human judgment than their RadImageNet counterparts. We evaluated sixteen StyleGAN2 networks across four medical imaging modalities and four data augmentation techniques with Fr'echet distances (FDs) computed using eleven ImageNet or RadImageNet-trained feature extractors. Comparison with human judgment via visual Turing tests revealed that ImageNet-based extractors produced rankings consistent with human judgment, with the FD derived from the ImageNet-trained SwAV extractor significantly correlating with expert evaluations. In contrast, RadImageNet-based rankings were volatile and inconsistent with human judgment. Our findings challenge prevailing assumptions, providing novel evidence that medical image-trained feature extractors do not inherently improve FDs and can even compromise their reliability. Our code is available at https://github.com/mckellwoodland/fid-med-eval.

5/30/2024

Analyzing the Feature Extractor Networks for Face Image Synthesis

Erdi Sar{i}tac{s}, Haz{i}m Kemal Ekenel

Advancements like Generative Adversarial Networks have attracted the attention of researchers toward face image synthesis to generate ever more realistic images. Thereby, the need for the evaluation criteria to assess the realism of the generated images has become apparent. While FID utilized with InceptionV3 is one of the primary choices for benchmarking, concerns about InceptionV3's limitations for face images have emerged. This study investigates the behavior of diverse feature extractors -- InceptionV3, CLIP, DINOv2, and ArcFace -- considering a variety of metrics -- FID, KID, Precision&Recall. While the FFHQ dataset is used as the target domain, as the source domains, the CelebA-HQ dataset and the synthetic datasets generated using StyleGAN2 and Projected FastGAN are used. Experiments include deep-down analysis of the features: $L_2$ normalization, model attention during extraction, and domain distributions in the feature space. We aim to give valuable insights into the behavior of feature extractors for evaluating face image synthesis methodologies. The code is publicly available at https://github.com/ThEnded32/AnalyzingFeatureExtractors.

6/5/2024

Facial Image Feature Analysis and its Specialization for Fr'echet Distance and Neighborhoods

Doruk Cetin, Benedikt Schesch, Petar Stamenkovic, Niko Benjamin Huber, Fabio Zund, Majed El Helou

Assessing distances between images and image datasets is a fundamental task in vision-based research. It is a challenging open problem in the literature and despite the criticism it receives, the most ubiquitous method remains the Fr'echet Inception Distance. The Inception network is trained on a specific labeled dataset, ImageNet, which has caused the core of its criticism in the most recent research. Improvements were shown by moving to self-supervision learning over ImageNet, leaving the training data domain as an open question. We make that last leap and provide the first analysis on domain-specific feature training and its effects on feature distance, on the widely-researched facial image domain. We provide our findings and insights on this domain specialization for Fr'echet distance and image neighborhoods, supported by extensive experiments and in-depth user studies.

6/27/2024

🖼️

Using Skew to Assess the Quality of GAN-generated Image Features

Lorenzo Luzi, Helen Jenne, Ryan Murray, Carlos Ortiz Marrero

The rapid advancement of Generative Adversarial Networks (GANs) necessitates the need to robustly evaluate these models. Among the established evaluation criteria, the Fr'{e}chetInception Distance (FID) has been widely adopted due to its conceptual simplicity, fast computation time, and strong correlation with human perception. However, FID has inherent limitations, mainly stemming from its assumption that feature embeddings follow a Gaussian distribution, and therefore can be defined by their first two moments. As this does not hold in practice, in this paper we explore the importance of third-moments in image feature data and use this information to define a new measure, which we call the Skew Inception Distance (SID). We prove that SID is a pseudometric on probability distributions, show how it extends FID, and present a practical method for its computation. Our numerical experiments support that SID either tracks with FID or, in some cases, aligns more closely with human perception when evaluating image features of ImageNet data. Our work also shows that principal component analysis can be used to speed up the computation time of both FID and SID. Although we focus on using SID on image features for GAN evaluation, SID is applicable much more generally, including for the evaluation of other generative models.

5/1/2024