A Semantic-Aware and Multi-Guided Network for Infrared-Visible Image Fusion

Read original: arXiv:2407.06159 - Published 7/9/2024 by Xiaoli Zhang, Liying Wang, Libo Zhao, Xiongfei Li, Siwei Ma

A Semantic-Aware and Multi-Guided Network for Infrared-Visible Image Fusion

Overview

Proposes a semantic-aware and multi-guided network for fusing infrared and visible images
Uses feature aggregation, auto-encoder, graph neural network, and invertible neural network techniques
Aims to improve the quality and semantic consistency of fused images

Plain English Explanation

This paper presents a new approach for combining infrared and visible-light images to create a fused output that is both high-quality and semantically consistent. The key idea is to use multiple "guides" or techniques to extract and integrate relevant features from the input images.

One guide is a feature aggregation module that combines low-level details. Another is an auto-encoder that learns a compact representation of the images. A graph neural network is used to model relationships between semantic features, and an invertible neural network helps preserve the structure of the input images.

By using this combination of complementary techniques, the method is able to fuse the infrared and visible images in a way that retains important details and maintains semantic consistency, leading to better overall quality of the final output.

Technical Explanation

The proposed network consists of several key components:

Feature Aggregation Module: This module uses cross-attention to selectively combine low-level features from the infrared and visible images, focusing on the most relevant details.
Auto-Encoder: An auto-encoder is used to learn a compact, semantic representation of the input images. This helps the network understand the high-level content and context.
Graph Neural Network: A graph neural network models the relationships between semantic features, allowing the network to capture and preserve important spatial and semantic information.
Invertible Neural Network: An invertible neural network is employed to ensure that the structure and details of the input images are maintained in the fused output.

The authors evaluate their method on several infrared-visible image fusion benchmarks and show that it outperforms existing state-of-the-art techniques in terms of both quantitative metrics and subjective visual quality.

Critical Analysis

The paper presents a well-designed and comprehensive approach to infrared-visible image fusion. The use of multiple complementary techniques is a key strength, as it allows the network to leverage the advantages of each component.

However, one potential limitation is the computational complexity of the overall model, as the combination of several neural network modules may result in a slower inference time. The authors do not provide detailed information on the runtime performance of their method.

Additionally, the paper does not discuss the robustness of the approach to variations in input data, such as different imaging conditions or sensor characteristics. Further research could explore the generalization capabilities of the proposed network.

Conclusion

This paper introduces a semantic-aware and multi-guided network for fusing infrared and visible images. By integrating feature aggregation, auto-encoding, graph neural networks, and invertible neural networks, the method is able to produce high-quality fused outputs that maintain semantic consistency.

The technical innovations and the strong empirical results presented in this work contribute to the ongoing effort to improve the quality and usefulness of infrared-visible image fusion, which has important applications in fields such as surveillance, remote sensing, and autonomous navigation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

A Semantic-Aware and Multi-Guided Network for Infrared-Visible Image Fusion

Xiaoli Zhang, Liying Wang, Libo Zhao, Xiongfei Li, Siwei Ma

Multi-modality image fusion aims at fusing specific-modality and shared-modality information from two source images. To tackle the problem of insufficient feature extraction and lack of semantic awareness for complex scenes, this paper focuses on how to model correlation-driven decomposing features and reason high-level graph representation by efficiently extracting complementary features and multi-guided feature aggregation. We propose a three-branch encoder-decoder architecture along with corresponding fusion layers as the fusion strategy. The transformer with Multi-Dconv Transposed Attention and Local-enhanced Feed Forward network is used to extract shallow features after the depthwise convolution. In the three parallel branches encoder, Cross Attention and Invertible Block (CAI) enables to extract local features and preserve high-frequency texture details. Base feature extraction module (BFE) with residual connections can capture long-range dependency and enhance shared-modality expression capabilities. Graph Reasoning Module (GR) is introduced to reason high-level cross-modality relations and extract low-level details features as CAI's specific-modality complementary information simultaneously. Experiments demonstrate that our method has obtained competitive results compared with state-of-the-art methods in visible/infrared image fusion and medical image fusion tasks. Moreover, we surpass other fusion methods in terms of subsequent tasks, averagely scoring 9.78% [email protected] higher in object detection and 6.46% mIoU higher in semantic segmentation.

7/9/2024

IVGF: The Fusion-Guided Infrared and Visible General Framework

Fangcen Liu, Chenqiang Gao, Fang Chen, Pengcheng Li, Junjie Guo, Deyu Meng

Infrared and visible dual-modality tasks such as semantic segmentation and object detection can achieve robust performance even in extreme scenes by fusing complementary information. Most current methods design task-specific frameworks, which are limited in generalization across multiple tasks. In this paper, we propose a fusion-guided infrared and visible general framework, IVGF, which can be easily extended to many high-level vision tasks. Firstly, we adopt the SOTA infrared and visible foundation models to extract the general representations. Then, to enrich the semantics information of these general representations for high-level vision tasks, we design the feature enhancement module and token enhancement module for feature maps and tokens, respectively. Besides, the attention-guided fusion module is proposed for effectively fusing by exploring the complementary information of two modalities. Moreover, we also adopt the cutout&mix augmentation strategy to conduct the data augmentation, which further improves the ability of the model to mine the regional complementary between the two modalities. Extensive experiments show that the IVGF outperforms state-of-the-art dual-modality methods in the semantic segmentation and object detection tasks. The detailed ablation studies demonstrate the effectiveness of each module, and another experiment explores the anti-missing modality ability of the proposed method in the dual-modality semantic segmentation task.

9/17/2024

CrossFuse: A Novel Cross Attention Mechanism based Infrared and Visible Image Fusion Approach

Hui Li, Xiao-Jun Wu

Multimodal visual information fusion aims to integrate the multi-sensor data into a single image which contains more complementary information and less redundant features. However the complementary information is hard to extract, especially for infrared and visible images which contain big similarity gap between these two modalities. The common cross attention modules only consider the correlation, on the contrary, image fusion tasks need focus on complementarity (uncorrelation). Hence, in this paper, a novel cross attention mechanism (CAM) is proposed to enhance the complementary information. Furthermore, a two-stage training strategy based fusion scheme is presented to generate the fused images. For the first stage, two auto-encoder networks with same architecture are trained for each modality. Then, with the fixed encoders, the CAM and a decoder are trained in the second stage. With the trained CAM, features extracted from two modalities are integrated into one fused feature in which the complementary information is enhanced and the redundant features are reduced. Finally, the fused image can be generated by the trained decoder. The experimental results illustrate that our proposed fusion method obtains the SOTA fusion performance compared with the existing fusion networks. The codes are available at https://github.com/hli1221/CrossFuse

6/18/2024

HSFusion: A high-level vision task-driven infrared and visible image fusion network via semantic and geometric domain transformation

Chengjie Jiang, Xiaowen Liu, Bowen Zheng, Lu Bai, Jing Li

Infrared and visible image fusion has been developed from vision perception oriented fusion methods to strategies which both consider the vision perception and high-level vision task. However, the existing task-driven methods fail to address the domain gap between semantic and geometric representation. To overcome these issues, we propose a high-level vision task-driven infrared and visible image fusion network via semantic and geometric domain transformation, terms as HSFusion. Specifically, to minimize the gap between semantic and geometric representation, we design two separate domain transformation branches by CycleGAN framework, and each includes two processes: the forward segmentation process and the reverse reconstruction process. CycleGAN is capable of learning domain transformation patterns, and the reconstruction process of CycleGAN is conducted under the constraint of these patterns. Thus, our method can significantly facilitate the integration of semantic and geometric information and further reduces the domain gap. In fusion stage, we integrate the infrared and visible features that extracted from the reconstruction process of two seperate CycleGANs to obtain the fused result. These features, containing varying proportions of semantic and geometric information, can significantly enhance the high level vision tasks. Additionally, we generate masks based on segmentation results to guide the fusion task. These masks can provide semantic priors, and we design adaptive weights for two distinct areas in the masks to facilitate image fusion. Finally, we conducted comparative experiments between our method and eleven other state-of-the-art methods, demonstrating that our approach surpasses others in both visual appeal and semantic segmentation task.

7/16/2024