CoMoFusion: Fast and High-quality Fusion of Infrared and Visible Image with Consistency Model

Read original: arXiv:2405.20764 - Published 6/13/2024 by Zhiming Meng, Hui Li, Zeyang Zhang, Zhongwei Shen, Yunlong Yu, Xiaoning Song, Xiaojun Wu

CoMoFusion: Fast and High-quality Fusion of Infrared and Visible Image with Consistency Model

Overview

This paper proposes a new method called CoMoFusion for fusing infrared and visible images quickly and with high quality.
The key idea is to use a consistency model to ensure the fused image maintains important details from both the infrared and visible inputs.
The method is designed to be computationally efficient, making it suitable for real-time applications.

Plain English Explanation

In many situations, we need to combine information from different types of images to get a more complete understanding of a scene. For example, infrared images can show heat signatures that are invisible in regular visible light images, while visible light images provide more detailed color and texture information.

The process of merging these different types of images is called "image fusion." The CoMoFusion method aims to do this fusion quickly and accurately, producing a final image that captures the key details from both the infrared and visible inputs.

The core idea is to use a "consistency model" that ensures the fused image is consistent with both the infrared and visible source images. This helps maintain important features like object boundaries, textures, and heat signatures, rather than just blending the two inputs together.

CoMoFusion is designed to be computationally efficient, meaning it can run quickly even on constrained hardware. This makes it suitable for real-time applications like object detection or video processing, where fast image fusion is crucial.

Technical Explanation

The CoMoFusion method takes infrared and visible light images as input and produces a fused output image that combines the key details from both. The core innovation is the use of a "consistency model" that ensures the fused image remains consistent with the original inputs.

Specifically, the consistency model consists of two components:

Spatial Consistency: This component ensures the spatial structures and boundaries in the fused image align with those in the input images. This helps preserve important visual details.
Intensity Consistency: This component matches the overall brightness and contrast of the fused image to the inputs, preventing distortion of important features like heat signatures.

The consistency model is integrated into an end-to-end neural network architecture that performs the fusion. This allows the network to learn how to best leverage the consistency constraints during the fusion process.

Experiments show that CoMoFusion is able to produce high-quality fused images while being significantly more efficient computationally compared to previous fusion methods. This efficiency makes it suitable for real-time applications that require fast image fusion.

Critical Analysis

The authors provide a thorough evaluation of CoMoFusion, demonstrating its advantages over previous fusion techniques. However, some potential limitations or areas for further research are worth considering:

The paper focuses on fusing just infrared and visible light images. It would be interesting to see how the consistency model approach extends to fusing a wider range of image modalities.
The experiments are conducted on fairly constrained datasets. Evaluating performance on more diverse, real-world scenes could uncover additional challenges or tradeoffs.
While the computational efficiency of CoMoFusion is a key strength, the authors do not provide much detail on the actual inference speeds or hardware requirements. More concrete benchmarking could help users understand the practical deployment considerations.
The paper does not discuss potential failure cases or edge cases where the consistency model approach might break down. Exploring the robustness of the method would be a valuable area for future work.

Overall, CoMoFusion represents a promising advance in efficient and high-quality image fusion. Further research building on this foundation could lead to even more powerful and versatile multi-modal fusion capabilities.

Conclusion

The CoMoFusion method introduced in this paper offers a new approach to fusing infrared and visible light images quickly and with high quality. By using a consistency model to ensure the fused output maintains important details from both inputs, CoMoFusion is able to outperform previous fusion techniques in terms of both visual quality and computational efficiency.

This efficiency makes CoMoFusion well-suited for real-time applications that require fast image fusion, such as object detection, surveillance, and autonomous navigation. As the field of multi-modal sensing continues to advance, methods like CoMoFusion will become increasingly valuable for extracting the full breadth of information from diverse imaging modalities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

CoMoFusion: Fast and High-quality Fusion of Infrared and Visible Image with Consistency Model

Zhiming Meng, Hui Li, Zeyang Zhang, Zhongwei Shen, Yunlong Yu, Xiaoning Song, Xiaojun Wu

Generative models are widely utilized to model the distribution of fused images in the field of infrared and visible image fusion. However, current generative models based fusion methods often suffer from unstable training and slow inference speed. To tackle this problem, a novel fusion method based on consistency model is proposed, termed as CoMoFusion, which can generate the high-quality images and achieve fast image inference speed. In specific, the consistency model is used to construct multi-modal joint features in the latent space with the forward and reverse process. Then, the infrared and visible features extracted by the trained consistency model are fed into fusion module to generate the final fused image. In order to enhance the texture and salient information of fused images, a novel loss based on pixel value selection is also designed. Extensive experiments on public datasets illustrate that our method obtains the SOTA fusion performance compared with the existing fusion methods.

6/13/2024

A Semantic-Aware and Multi-Guided Network for Infrared-Visible Image Fusion

Xiaoli Zhang, Liying Wang, Libo Zhao, Xiongfei Li, Siwei Ma

Multi-modality image fusion aims at fusing specific-modality and shared-modality information from two source images. To tackle the problem of insufficient feature extraction and lack of semantic awareness for complex scenes, this paper focuses on how to model correlation-driven decomposing features and reason high-level graph representation by efficiently extracting complementary features and multi-guided feature aggregation. We propose a three-branch encoder-decoder architecture along with corresponding fusion layers as the fusion strategy. The transformer with Multi-Dconv Transposed Attention and Local-enhanced Feed Forward network is used to extract shallow features after the depthwise convolution. In the three parallel branches encoder, Cross Attention and Invertible Block (CAI) enables to extract local features and preserve high-frequency texture details. Base feature extraction module (BFE) with residual connections can capture long-range dependency and enhance shared-modality expression capabilities. Graph Reasoning Module (GR) is introduced to reason high-level cross-modality relations and extract low-level details features as CAI's specific-modality complementary information simultaneously. Experiments demonstrate that our method has obtained competitive results compared with state-of-the-art methods in visible/infrared image fusion and medical image fusion tasks. Moreover, we surpass other fusion methods in terms of subsequent tasks, averagely scoring 9.78% [email protected] higher in object detection and 6.46% mIoU higher in semantic segmentation.

7/9/2024

Implicit Multi-Spectral Transformer: An Lightweight and Effective Visible to Infrared Image Translation Model

Yijia Chen, Pinghua Chen, Xiangxin Zhou, Yingtie Lei, Ziyang Zhou, Mingxian Li

In the field of computer vision, visible light images often exhibit low contrast in low-light conditions, presenting a significant challenge. While infrared imagery provides a potential solution, its utilization entails high costs and practical limitations. Recent advancements in deep learning, particularly the deployment of Generative Adversarial Networks (GANs), have facilitated the transformation of visible light images to infrared images. However, these methods often experience unstable training phases and may produce suboptimal outputs. To address these issues, we propose a novel end-to-end Transformer-based model that efficiently converts visible light images into high-fidelity infrared images. Initially, the Texture Mapping Module and Color Perception Adapter collaborate to extract texture and color features from the visible light image. The Dynamic Fusion Aggregation Module subsequently integrates these features. Finally, the transformation into an infrared image is refined through the synergistic action of the Color Perception Adapter and the Enhanced Perception Attention mechanism. Comprehensive benchmarking experiments confirm that our model outperforms existing methods, producing infrared images of markedly superior quality, both qualitatively and quantitatively. Furthermore, the proposed model enables more effective downstream applications for infrared images than other methods.

4/30/2024

SimpleFusion: A Simple Fusion Framework for Infrared and Visible Images

Ming Chen, Yuxuan Cheng, Xinwei He, Xinyue Wang, Yan Aze, Jinhai Xiang

Integrating visible and infrared images into one high-quality image, also known as visible and infrared image fusion, is a challenging yet critical task for many downstream vision tasks. Most existing works utilize pretrained deep neural networks or design sophisticated frameworks with strong priors for this task, which may be unsuitable or lack flexibility. This paper presents SimpleFusion, a simple yet effective framework for visible and infrared image fusion. Our framework follows the decompose-and-fusion paradigm, where the visible and the infrared images are decomposed into reflectance and illumination components via Retinex theory and followed by the fusion of these corresponding elements. The whole framework is designed with two plain convolutional neural networks without downsampling, which can perform image decomposition and fusion efficiently. Moreover, we introduce decomposition loss and a detail-to-semantic loss to preserve the complementary information between the two modalities for fusion. We conduct extensive experiments on the challenging benchmarks, verifying the superiority of our method over previous state-of-the-arts. Code is available at href{https://github.com/hxwxss/SimpleFusion-A-Simple-Fusion-Framework-for-Infrared-and-Visible-Images}{https://github.com/hxwxss/SimpleFusion-A-Simple-Fusion-Framework-for-Infrared-and-Visible-Images}

6/28/2024