Similarity Metrics for MR Image-To-Image Translation

2405.08431

Published 6/19/2024 by Melanie Dohmen, Mark Klemens, Ivo Baltruschat, Tuan Truong, Matthias Lenga

🧪

Abstract

Image-to-image translation can create large impact in medical imaging, for instance the possibility to synthetically transform images to other modalities, sequence types, higher resolutions or lower noise levels. In order to assure a high level of patient safety, these methods are mostly validated by human reader studies, which require a considerable amount of time and costs. Quantitative metrics have been used to complement such studies and to provide reproducible and objective assessment of synthetic images. Even though the SSIM and PSNR metrics are extensively used, they do not detect all types of errors in synthetic images as desired. Other metrics could provide additional useful evaluation. In this study, we give an overview and a quantitative analysis of 15 metrics for assessing the quality of synthetically generated images. We include 11 full-reference metrics (SSIM, MS-SSIM, CW-SSIM, PSNR, MSE, NMSE, MAE, LPIPS, DISTS, NMI and PCC), three non-reference metrics (BLUR, MLC, MSLC) and one downstream task segmentation metric (DICE) to detect 11 kinds of typical distortions and artifacts that occur in MR images. In addition, we analyze the influence of four prominent normalization methods (Minmax, cMinmax, Zscore and Quantile) on the different metrics and distortions. Finally, we provide adverse examples to highlight pitfalls in metric assessment and derive recommendations for effective usage of the analyzed similarity metrics for evaluation of image-to-image translation models.

Create account to get full access

Overview

Image-to-image translation can have a significant impact in medical imaging, as it can help create new images from existing ones to aid diagnosis.
However, these translation methods need to be validated through human reader studies, which are costly and limited in sample size.
Automatic evaluation of large samples is needed to pre-evaluate and continuously improve these methods before human validation.
This study examines various reference and non-reference metrics for assessing the quality of synthesized medical images, focusing on their ability to detect different types of distortions in MRI data.

Plain English Explanation

Medical imaging techniques like MRI can generate a wealth of information about a patient's health. Image-to-image translation is a powerful tool that can transform one type of medical image into another, potentially providing doctors with new perspectives and insights to improve diagnosis.

For example, if an MRI scan could be "translated" into a different type of image, it might reveal features that were previously hidden or difficult to see. This could be very valuable for medical professionals, as it could lead to earlier detection of diseases or more accurate treatment plans.

However, before these translation methods can be used in real-world medical settings, they need to be thoroughly tested and validated. This often involves having human experts, like radiologists, review the translated images and provide their assessment. Unfortunately, these human reader studies can be very time-consuming and expensive, and they're typically limited to small sample sizes.

To overcome this challenge, the researchers in this study explored the idea of using automated evaluation metrics to assess the quality of the translated images. By testing a variety of different metrics, they aimed to identify the ones that could most effectively detect various types of distortions or errors in the synthesized MRI data. This could allow researchers to quickly evaluate and improve their translation models before the final, costly human validation step.

Technical Explanation

The researchers investigated the performance of nine reference-based metrics (SSIM, MS-SSIM, PSNR, MSE, NMSE, MAE, LPIPS, NMI, and PCC) and three non-reference metrics (BLUR, MSN, and MNG) in detecting 11 different types of distortions in MRI images from the BraSyn dataset. They also tested the impact of three normalization methods (Minmax, cMinMax, and Zscore) on the metrics.

The results showed that commonly used metrics like PSNR and SSIM have significant limitations when it comes to evaluating generative models for medical image-to-image translation tasks. For example, SSIM was found to be very sensitive to intensity shifts in unnormalized MRI images, while ignoring issues like blurriness. PSNR, on the other hand, was highly dependent on the normalization method used and did not effectively measure the degree of distortions.

Other metrics, such as LPIPS, NMI, and DICE, were found to be more useful for evaluating different aspects of image similarity. However, the researchers note that if the images to be compared are misaligned, most of the tested metrics will be flawed.

The key takeaway is that by carefully selecting and combining multiple image similarity metrics, researchers can improve the training and selection of generative models for medical image synthesis. This can help validate the models' outputs before the final, costly evaluation by human experts.

Critical Analysis

The researchers provide a comprehensive evaluation of various reference and non-reference metrics for assessing the quality of synthesized medical images. However, the study is limited to the BraSyn dataset and a specific set of distortions. It would be valuable to see how the metrics perform on a wider range of medical imaging modalities and distortion types.

Additionally, the paper does not address the issue of computational complexity and runtime for the different metrics. In a real-world setting, the speed and efficiency of the evaluation process may be just as important as its accuracy.

It would also be interesting to explore the potential of multi-view evaluation approaches or semantic-based metrics that could provide a more holistic assessment of the translated images' clinical relevance and usefulness.

Overall, this study provides valuable insights into the strengths and limitations of different image quality metrics for medical image translation tasks. However, more research is needed to develop a comprehensive, reliable, and efficient evaluation framework that can fully support the development and deployment of these powerful techniques in real-world clinical settings.

Conclusion

This study presents a detailed evaluation of various reference and non-reference metrics for assessing the quality of synthesized medical images, with a focus on their ability to detect different types of distortions in MRI data. The findings reveal that commonly used metrics like PSNR and SSIM have significant limitations when it comes to evaluating generative models for medical image-to-image translation tasks.

By carefully selecting and combining multiple image similarity metrics, researchers can improve the training and selection of these generative models, allowing them to validate the models' outputs before the final, costly evaluation by human experts. This could lead to more efficient and effective development of image-to-image translation techniques that can have a significant impact on medical imaging and diagnosis.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Rethinking Perceptual Metrics for Medical Image Translation

Nicholas Konz, Yuwen Chen, Hanxue Gu, Haoyu Dong, Maciej A. Mazurowski

Modern medical image translation methods use generative models for tasks such as the conversion of CT images to MRI. Evaluating these methods typically relies on some chosen downstream task in the target domain, such as segmentation. On the other hand, task-agnostic metrics are attractive, such as the network feature-based perceptual metrics (e.g., FID) that are common to image translation in general computer vision. In this paper, we investigate evaluation metrics for medical image translation on two medical image translation tasks (GE breast MRI to Siemens breast MRI and lumbar spine MRI to CT), tested on various state-of-the-art translation methods. We show that perceptual metrics do not generally correlate with segmentation metrics due to them extending poorly to the anatomical constraints of this sub-field, with FID being especially inconsistent. However, we find that the lesser-used pixel-level SWD metric may be useful for subtle intra-modality translation. Our results demonstrate the need for further research into helpful metrics for medical image translation.

4/12/2024

eess.IV cs.CV cs.LG

🤿

Do High-Performance Image-to-Image Translation Networks Enable the Discovery of Radiomic Features? Application to MRI Synthesis from Ultrasound in Prostate Cancer

Mohammad R. Salmanpour, Amin Mousavi, Yixi Xu, William B Weeks, Ilker Hacihaliloglu

This study investigates the foundational characteristics of image-to-image translation networks, specifically examining their suitability and transferability within the context of routine clinical environments, despite achieving high levels of performance, as indicated by a Structural Similarity Index (SSIM) exceeding 0.95. The evaluation study was conducted using data from 794 patients diagnosed with Prostate cancer. To synthesize MRI from Ultrasound images, we employed five widely recognized image to image translation networks in medical imaging: 2DPix2Pix, 2DCycleGAN, 3DCycleGAN, 3DUNET, and 3DAutoEncoder. For quantitative assessment, we report four prevalent evaluation metrics Mean Absolute Error, Mean Square Error, Structural Similarity Index (SSIM), and Peak Signal to Noise Ratio. Moreover, a complementary analysis employing Radiomic features (RF) via Spearman correlation coefficient was conducted to investigate, for the first time, whether networks achieving high performance, SSIM greater than 0.85, could identify low-level RFs. The RF analysis showed 75 features out of 186 RFs were discovered via just 2DPix2Pix algorithm while half of RFs were lost in the translation process. Finally, a detailed qualitative assessment by five medical doctors indicated a lack of low level feature discovery in image to image translation tasks.

6/25/2024

eess.IV

Semantic Similarity Score for Measuring Visual Similarity at Semantic Level

Senran Fan, Zhicheng Bao, Chen Dong, Haotai Liang, Xiaodong Xu, Ping Zhang

Semantic communication, as a revolutionary communication architecture, is considered a promising novel communication paradigm. Unlike traditional symbol-based error-free communication systems, semantic-based visual communication systems extract, compress, transmit, and reconstruct images at the semantic level. However, widely used image similarity evaluation metrics, whether pixel-based MSE or PSNR or structure-based MS-SSIM, struggle to accurately measure the loss of semantic-level information of the source during system transmission. This presents challenges in evaluating the performance of visual semantic communication systems, especially when comparing them with traditional communication systems. To address this, we propose a semantic evaluation metric -- SeSS (Semantic Similarity Score), based on Scene Graph Generation and graph matching, which shifts the similarity scores between images into semantic-level graph matching scores. Meanwhile, semantic similarity scores for tens of thousands of image pairs are manually annotated to fine-tune the hyperparameters in the graph matching algorithm, aligning the metric more closely with human semantic perception. The performance of the SeSS is tested on different datasets, including (1)images transmitted by traditional and semantic communication systems at different compression rates, (2)images transmitted by traditional and semantic communication systems at different signal-to-noise ratios, (3)images generated by large-scale model with different noise levels introduced, and (4)cases of images subjected to certain special transformations. The experiments demonstrate the effectiveness of SeSS, indicating that the metric can measure the semantic-level differences in semantic-level information of images and can be used for evaluation in visual semantic communication systems.

6/7/2024

cs.CV cs.AI

A study on the adequacy of common IQA measures for medical images

Anna Breger, Clemens Karner, Ian Selby, Janek Grohl, Soren Dittmer, Edward Lilley, Judith Babar, Jake Beckford, Timothy J Sadler, Shahab Shahipasand, Arthikkaa Thavakumar, Michael Roberts, Carola-Bibiane Schonlieb

Image quality assessment (IQA) is standard practice in the development stage of novel machine learning algorithms that operate on images. The most commonly used IQA measures have been developed and tested for natural images, but not in the medical setting. Reported inconsistencies arising in medical images are not surprising, as they have different properties than natural images. In this study, we test the applicability of common IQA measures for medical image data by comparing their assessment to manually rated chest X-ray (5 experts) and photoacoustic image data (1 expert). Moreover, we include supplementary studies on grayscale natural images and accelerated brain MRI data. The results of all experiments show a similar outcome in line with previous findings for medical imaging: PSNR and SSIM in the default setting are in the lower range of the result list and HaarPSI outperforms the other tested measures in the overall performance. Also among the top performers in our medical experiments are the full reference measures DISTS, FSIM, LPIPS and MS-SSIM. Generally, the results on natural images yield considerably higher correlations, suggesting that the additional employment of tailored IQA measures for medical imaging algorithms is needed.

5/30/2024

eess.IV cs.CV