S4Fusion: Saliency-aware Selective State Space Model for Infrared Visible Image Fusion

Read original: arXiv:2405.20881 - Published 6/4/2024 by Haolong Ma, Hui Li, Chunyang Cheng, Gaoang Wang, Xiaoning Song, Xiaojun Wu

S4Fusion: Saliency-aware Selective State Space Model for Infrared Visible Image Fusion

Overview

Presents a new saliency-aware image fusion method called S4Fusion for combining infrared and visible images
Proposes a selective state space model that adaptively fuses salient features from the input images
Aims to improve the quality and accuracy of fused images for applications like surveillance and autonomous driving

Plain English Explanation

S4Fusion is a new technique for combining infrared and visible light images to create a single, high-quality fused image. Infrared cameras can detect heat and are useful for identifying people or objects in low-light conditions, while visible light cameras capture color information that is familiar to the human eye. By fusing these two types of images, S4Fusion can produce a result that highlights important details from both sources.

The key innovation of S4Fusion is its "selective state space model," which adaptively selects and fuses the most salient, or visually important, features from the input images. This helps ensure that the final fused image retains the critical information from each modality, rather than simply blending the two together. The model is designed to work well for applications like surveillance, autonomous vehicles, and other scenarios where accurate and informative fused imagery is crucial.

Technical Explanation

S4Fusion uses a selective state space model to fuse infrared and visible light images. This model adaptively chooses which features from each input image to emphasize in the final fused output. The approach is inspired by the FusionMamba and CoMoFusion methods, which also leverage state space modeling for effective multimodal image fusion.

The key innovation of S4Fusion is the inclusion of a saliency detection mechanism that identifies the most visually salient regions in the input images. This saliency information is then used to selectively fuse the most important features, rather than simply combining the entire input images. This helps preserve critical details that may be lost in a more naive fusion approach.

The S4Fusion architecture consists of several stages. First, saliency maps are generated for the infrared and visible light images using a deep learning-based saliency detection model. Next, the input images and their corresponding saliency maps are fed into the selective state space fusion model, which adaptively combines the most salient features. Finally, post-processing steps are applied to refine the fused output.

The authors evaluate S4Fusion on several infrared-visible image fusion datasets and compare its performance to state-of-the-art methods like Coupled-MAMBA, Fusion-MAMBA, and MMA-UNet. The results demonstrate that S4Fusion achieves superior fusion quality and accuracy, making it a promising approach for real-world applications that require reliable multimodal image fusion.

Critical Analysis

The S4Fusion paper presents a well-designed and empirically validated method for fusing infrared and visible light images. The inclusion of saliency detection is a clever and effective way to ensure that the most important features from each modality are preserved in the final fused output.

However, the paper does not address a few potential limitations. For example, the saliency detection model used in S4Fusion is a pre-trained deep learning architecture, which means it may not generalize well to all types of infrared and visible light imagery. Additionally, the selective fusion process could potentially discard important details that are not deemed salient by the model, which could be problematic for certain applications.

It would also be valuable to see the authors explore the computational efficiency and real-time performance of S4Fusion, as these factors are crucial for many practical use cases like autonomous vehicles and surveillance systems. Further research could also investigate the robustness of S4Fusion to noise, occlusions, or other real-world challenges.

Overall, S4Fusion represents an important step forward in multimodal image fusion, and the authors' thoughtful incorporation of saliency-awareness is a valuable contribution to the field. However, as with any research, there are opportunities for continued refinement and expansion to address the remaining challenges and limitations.

Conclusion

The S4Fusion paper presents a novel saliency-aware image fusion method that combines infrared and visible light images in a selective and adaptive manner. By leveraging saliency detection to identify the most important features in each input modality, S4Fusion is able to produce high-quality fused outputs that preserve critical details for applications like surveillance, autonomous driving, and other scenarios where accurate multimodal imaging is essential.

The technical innovations and strong empirical results demonstrated in this paper make S4Fusion a promising approach for the future of multimodal image fusion. As the authors continue to refine and expand the method, it has the potential to have a significant impact on a wide range of real-world computer vision and imaging applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

S4Fusion: Saliency-aware Selective State Space Model for Infrared Visible Image Fusion

Haolong Ma, Hui Li, Chunyang Cheng, Gaoang Wang, Xiaoning Song, Xiaojun Wu

As one of the tasks in Image Fusion, Infrared and Visible Image Fusion aims to integrate complementary information captured by sensors of different modalities into a single image. The Selective State Space Model (SSSM), known for its ability to capture long-range dependencies, has demonstrated its potential in the field of computer vision. However, in image fusion, current methods underestimate the potential of SSSM in capturing the global spatial information of both modalities. This limitation prevents the simultaneous consideration of the global spatial information from both modalities during interaction, leading to a lack of comprehensive perception of salient targets. Consequently, the fusion results tend to bias towards one modality instead of adaptively preserving salient targets. To address this issue, we propose the Saliency-aware Selective State Space Fusion Model (S4Fusion). In our S4Fusion, the designed Cross-Modal Spatial Awareness Module (CMSA) can simultaneously focus on global spatial information from both modalities while facilitating their interaction, thereby comprehensively capturing complementary information. Additionally, S4Fusion leverages a pre-trained network to perceive uncertainty in the fused images. By minimizing this uncertainty, S4Fusion adaptively highlights salient targets from both images. Extensive experiments demonstrate that our approach produces high-quality images and enhances performance in downstream tasks.

6/4/2024

🖼️

FusionMamba: Efficient Image Fusion with State Space Model

Siran Peng, Xiangyu Zhu, Haoyu Deng, Zhen Lei, Liang-Jian Deng

Image fusion aims to generate a high-resolution multi/hyper-spectral image by combining a high-resolution image with limited spectral information and a low-resolution image with abundant spectral data. Current deep learning (DL)-based methods for image fusion primarily rely on CNNs or Transformers to extract features and merge different types of data. While CNNs are efficient, their receptive fields are limited, restricting their capacity to capture global context. Conversely, Transformers excel at learning global information but are hindered by their quadratic complexity. Fortunately, recent advancements in the State Space Model (SSM), particularly Mamba, offer a promising solution to this issue by enabling global awareness with linear complexity. However, there have been few attempts to explore the potential of the SSM in information fusion, which is a crucial ability in domains like image fusion. Therefore, we propose FusionMamba, an innovative method for efficient image fusion. Our contributions mainly focus on two aspects. Firstly, recognizing that images from different sources possess distinct properties, we incorporate Mamba blocks into two U-shaped networks, presenting a novel architecture that extracts spatial and spectral features in an efficient, independent, and hierarchical manner. Secondly, to effectively combine spatial and spectral information, we extend the Mamba block to accommodate dual inputs. This expansion leads to the creation of a new module called the FusionMamba block, which outperforms existing fusion techniques such as concatenation and cross-attention. We conduct a series of experiments on five datasets related to three image fusion tasks. The quantitative and qualitative evaluation results demonstrate that our method achieves SOTA performance, underscoring the superiority of FusionMamba. The code is available at https://github.com/PSRben/FusionMamba.

5/14/2024

CrossFuse: A Novel Cross Attention Mechanism based Infrared and Visible Image Fusion Approach

Hui Li, Xiao-Jun Wu

Multimodal visual information fusion aims to integrate the multi-sensor data into a single image which contains more complementary information and less redundant features. However the complementary information is hard to extract, especially for infrared and visible images which contain big similarity gap between these two modalities. The common cross attention modules only consider the correlation, on the contrary, image fusion tasks need focus on complementarity (uncorrelation). Hence, in this paper, a novel cross attention mechanism (CAM) is proposed to enhance the complementary information. Furthermore, a two-stage training strategy based fusion scheme is presented to generate the fused images. For the first stage, two auto-encoder networks with same architecture are trained for each modality. Then, with the fixed encoders, the CAM and a decoder are trained in the second stage. With the trained CAM, features extracted from two modalities are integrated into one fused feature in which the complementary information is enhanced and the redundant features are reduced. Finally, the fused image can be generated by the trained decoder. The experimental results illustrate that our proposed fusion method obtains the SOTA fusion performance compared with the existing fusion networks. The codes are available at https://github.com/hli1221/CrossFuse

6/18/2024

CoMoFusion: Fast and High-quality Fusion of Infrared and Visible Image with Consistency Model

Zhiming Meng, Hui Li, Zeyang Zhang, Zhongwei Shen, Yunlong Yu, Xiaoning Song, Xiaojun Wu

Generative models are widely utilized to model the distribution of fused images in the field of infrared and visible image fusion. However, current generative models based fusion methods often suffer from unstable training and slow inference speed. To tackle this problem, a novel fusion method based on consistency model is proposed, termed as CoMoFusion, which can generate the high-quality images and achieve fast image inference speed. In specific, the consistency model is used to construct multi-modal joint features in the latent space with the forward and reverse process. Then, the infrared and visible features extracted by the trained consistency model are fed into fusion module to generate the final fused image. In order to enhance the texture and salient information of fused images, a novel loss based on pixel value selection is also designed. Extensive experiments on public datasets illustrate that our method obtains the SOTA fusion performance compared with the existing fusion methods.

6/13/2024