HAPNet: Toward Superior RGB-Thermal Scene Parsing via Hybrid, Asymmetric, and Progressive Heterogeneous Feature Fusion






Published 4/9/2024 by Jiahang Li, Peng Yun, Qijun Chen, Rui Fan
HAPNet: Toward Superior RGB-Thermal Scene Parsing via Hybrid, Asymmetric, and Progressive Heterogeneous Feature Fusion


Data-fusion networks have shown significant promise for RGB-thermal scene parsing. However, the majority of existing studies have relied on symmetric duplex encoders for heterogeneous feature extraction and fusion, paying inadequate attention to the inherent differences between RGB and thermal modalities. Recent progress in vision foundation models (VFMs) trained through self-supervision on vast amounts of unlabeled data has proven their ability to extract informative, general-purpose features. However, this potential has yet to be fully leveraged in the domain. In this study, we take one step toward this new research area by exploring a feasible strategy to fully exploit VFM features for RGB-thermal scene parsing. Specifically, we delve deeper into the unique characteristics of RGB and thermal modalities, thereby designing a hybrid, asymmetric encoder that incorporates both a VFM and a convolutional neural network. This design allows for more effective extraction of complementary heterogeneous features, which are subsequently fused in a dual-path, progressive manner. Moreover, we introduce an auxiliary task to further enrich the local semantics of the fused features, thereby improving the overall performance of RGB-thermal scene parsing. Our proposed HAPNet, equipped with all these components, demonstrates superior performance compared to all other state-of-the-art RGB-thermal scene parsing networks, achieving top ranks across three widely used public RGB-thermal scene parsing datasets. We believe this new paradigm has opened up new opportunities for future developments in data-fusion scene parsing approaches.

Create account to get full access


If you already have an account, we'll log you in


  • This paper introduces HAPNet, a new method for RGB-thermal scene parsing that uses a hybrid, asymmetric, and progressive approach to feature fusion.
  • The key ideas are to leverage both RGB and thermal data, fuse the features in a progressive and asymmetric way, and use a hybrid model architecture.
  • The goal is to achieve superior performance in scene parsing tasks compared to existing methods.

Plain English Explanation

Scene parsing is the task of understanding the contents of an image, like identifying objects, people, and their locations. HAPNet aims to improve on this by using two types of visual data: regular color (RGB) images and thermal images that show heat patterns.

The researchers developed a new way to combine the information from these two data sources. Instead of just averaging or concatenating the features, they use a more sophisticated "hybrid, asymmetric, and progressive" fusion approach. This means the model learns to selectively and dynamically combine the RGB and thermal features, with the fusion happening in multiple stages.

The overall model architecture is also hybrid, using a combination of different neural network components. This allows the model to effectively leverage the complementary information in the RGB and thermal data.

The key innovation is this multi-stage, asymmetric feature fusion process. By adaptively fusing the RGB and thermal data, the model can capture richer scene understanding that outperforms previous approaches that used the data sources independently or in a more simplistic way.

Technical Explanation

HAPNet uses a <a href="https://aimodels.fyi/papers/arxiv/henet-hybrid-encoding-end-to-end-multi">hybrid encoder</a> to process both the RGB and thermal inputs. The encoder has separate pathways for each modality, which then feed into a <a href="https://aimodels.fyi/papers/arxiv/sne-roadsegv2-advancing-heterogeneous-feature-fusion-fallibility">heterogeneous feature fusion</a> module.

This fusion module performs an <a href="https://aimodels.fyi/papers/arxiv/mitigating-heterogeneity-federated-multimodal-learning-biomedical-vision">asymmetric and progressive</a> combination of the RGB and thermal features across multiple stages. This allows the model to selectively emphasize relevant features from each modality at different levels of the network.

The fused features are then passed through a decoder to produce the final scene parsing output. Experiments show that this <a href="https://aimodels.fyi/papers/arxiv/diffusion-hyperfeatures-searching-through-time-space-semantic">hybrid and progressive fusion</a> approach outperforms previous <a href="https://aimodels.fyi/papers/arxiv/causal-mode-multiplexer-novel-framework-unbiased-multispectral">heterogeneous feature fusion</a> methods on benchmark scene parsing datasets.

Critical Analysis

The paper provides a thorough evaluation of HAPNet's performance on several datasets, demonstrating its superiority over existing approaches. However, the authors acknowledge that the model's effectiveness may be limited to specific environmental conditions where thermal data provides complementary information to RGB.

Additionally, the computational complexity of the multi-stage fusion process could be a concern for real-time or resource-constrained applications. Further research is needed to explore ways to improve the efficiency of the model without sacrificing its performance.

The paper also does not address potential biases or fairness issues that may arise from the heterogeneous data sources. As these are important considerations for real-world deployments, future work should investigate the model's robustness and fairness across diverse settings and populations.


HAPNet represents a promising advance in RGB-thermal scene parsing by introducing a novel hybrid, asymmetric, and progressive feature fusion approach. By effectively leveraging the complementary strengths of RGB and thermal data, the model can achieve superior performance compared to previous methods.

While the paper demonstrates the technical merits of HAPNet, further research is needed to address its limitations and ensure its broader applicability and safety. Nonetheless, this work contributes valuable insights to the ongoing efforts in multimodal vision and heterogeneous feature fusion, with potential impacts on a range of applications that rely on robust scene understanding.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers


Quality-aware Selective Fusion Network for V-D-T Salient Object Detection

Liuxin Bao, Xiaofei Zhou, Xiankai Lu, Yaoqi Sun, Haibing Yin, Zhenghui Hu, Jiyong Zhang, Chenggang Yan





Depth images and thermal images contain the spatial geometry information and surface temperature information, which can act as complementary information for the RGB modality. However, the quality of the depth and thermal images is often unreliable in some challenging scenarios, which will result in the performance degradation of the two-modal based salient object detection (SOD). Meanwhile, some researchers pay attention to the triple-modal SOD task, where they attempt to explore the complementarity of the RGB image, the depth image, and the thermal image. However, existing triple-modal SOD methods fail to perceive the quality of depth maps and thermal images, which leads to performance degradation when dealing with scenes with low-quality depth and thermal images. Therefore, we propose a quality-aware selective fusion network (QSF-Net) to conduct VDT salient object detection, which contains three subnets including the initial feature extraction subnet, the quality-aware region selection subnet, and the region-guided selective fusion subnet. Firstly, except for extracting features, the initial feature extraction subnet can generate a preliminary prediction map from each modality via a shrinkage pyramid architecture. Then, we design the weakly-supervised quality-aware region selection subnet to generate the quality-aware maps. Concretely, we first find the high-quality and low-quality regions by using the preliminary predictions, which further constitute the pseudo label that can be used to train this subnet. Finally, the region-guided selective fusion subnet purifies the initial features under the guidance of the quality-aware maps, and then fuses the triple-modal features and refines the edge details of prediction maps through the intra-modality and inter-modality attention (IIA) module and the edge refinement (ER) module, respectively. Extensive experiments are performed on VDT-2048

Read more


Multi-scale HSV Color Feature Embedding for High-fidelity NIR-to-RGB Spectrum Translation

Huiyu Zhai, Mo Chen, Xingxing Yang, Gusheng Kang





The NIR-to-RGB spectral domain translation is a formidable task due to the inherent spectral mapping ambiguities within NIR inputs and RGB outputs. Thus, existing methods fail to reconcile the tension between maintaining texture detail fidelity and achieving diverse color variations. In this paper, we propose a Multi-scale HSV Color Feature Embedding Network (MCFNet) that decomposes the mapping process into three sub-tasks, including NIR texture maintenance, coarse geometry reconstruction, and RGB color prediction. Thus, we propose three key modules for each corresponding sub-task: the Texture Preserving Block (TPB), the HSV Color Feature Embedding Module (HSV-CFEM), and the Geometry Reconstruction Module (GRM). These modules contribute to our MCFNet methodically tackling spectral translation through a series of escalating resolutions, progressively enriching images with color and texture fidelity in a scale-coherent fashion. The proposed MCFNet demonstrates substantial performance gains over the NIR image colorization task. Code is released at: https://github.com/AlexYangxx/MCFNet.

Read more


Transformer-based RGB-T Tracking with Channel and Spatial Feature Fusion

Yunfeng Li, Bo Wang, Ye Li, Zhiwen Yu, Liang Wang





Complementary RGB and TIR modalities enable RGB-T tracking to achieve competitive performance in challenging scenarios. Therefore, how to better fuse cross-modal features is the core issue of RGB-T tracking. Some previous methods either insufficiently fuse RGB and TIR features, or depend on intermediaries containing information from both modalities to achieve cross-modal information interaction. The former does not fully exploit the potential of using only RGB and TIR information of the template or search region for channel and spatial feature fusion, and the latter lacks direct interaction between the template and search area, which limits the model's ability to fully exploit the original semantic information of both modalities. To alleviate these limitations, we explore how to improve the performance of a visual Transformer by using direct fusion of cross-modal channels and spatial features, and propose CSTNet. CSTNet uses ViT as a backbone and inserts cross-modal channel feature fusion modules (CFM) and cross-modal spatial feature fusion modules (SFM) for direct interaction between RGB and TIR features. The CFM performs parallel joint channel enhancement and joint multilevel spatial feature modeling of RGB and TIR features and sums the features, and then globally integrates the sum feature with the original features. The SFM uses cross-attention to model the spatial relationship of cross-modal features and then introduces a convolutional feedforward network for joint spatial and channel integration of multimodal features. Comprehensive experiments show that CSTNet achieves state-of-the-art performance on three public RGB-T tracking benchmarks. Code is available at https://github.com/LiYunfengLYF/CSTNet.

Read more



UniRGB-IR: A Unified Framework for Visible-Infrared Downstream Tasks via Adapter Tuning

Maoxun Yuan, Bo Cui, Tianyi Zhao, Xingxing Wei





Semantic analysis on visible (RGB) and infrared (IR) images has gained attention for its ability to be more accurate and robust under low-illumination and complex weather conditions. Due to the lack of pre-trained foundation models on the large-scale infrared image datasets, existing methods prefer to design task-specific frameworks and directly fine-tune them with pre-trained foundation models on their RGB-IR semantic relevance datasets, which results in poor scalability and limited generalization. In this work, we propose a scalable and efficient framework called UniRGB-IR to unify RGB-IR downstream tasks, in which a novel adapter is developed to efficiently introduce richer RGB-IR features into the pre-trained RGB-based foundation model. Specifically, our framework consists of a vision transformer (ViT) foundation model, a Multi-modal Feature Pool (MFP) module and a Supplementary Feature Injector (SFI) module. The MFP and SFI modules cooperate with each other as an adpater to effectively complement the ViT features with the contextual multi-scale features. During training process, we freeze the entire foundation model to inherit prior knowledge and only optimize the MFP and SFI modules. Furthermore, to verify the effectiveness of our framework, we utilize the ViT-Base as the pre-trained foundation model to perform extensive experiments. Experimental results on various RGB-IR downstream tasks demonstrate that our method can achieve state-of-the-art performance. The source code and results are available at https://github.com/PoTsui99/UniRGB-IR.git.

Read more
