AMFD: Distillation via Adaptive Multimodal Fusion for Multispectral Pedestrian Detection

Read original: arXiv:2405.12944 - Published 5/22/2024 by Zizhao Chen, Yeqiang Qian, Xiaoxiao Yang, Chunxiang Wang, Ming Yang

🔎

Overview

Multispectral pedestrian detection can improve performance in complex lighting conditions.
Existing "double-stream" networks use separate feature extraction branches for each data modality, leading to longer inference times.
This has hindered the use of multispectral detection in embedded devices for autonomous systems.
Knowledge distillation methods have been proposed to address this, but they focus only on fused features, ignoring the original modal features.

Plain English Explanation

Multispectral pedestrian detection refers to using cameras that capture images in multiple wavelengths of light (e.g., visible and infrared) to identify people in complex lighting scenarios. This has been shown to work better than using a single camera. However, the common "double-stream" networks used for this task have separate processing pipelines for each type of camera, which means they take longer to run.

This longer processing time has made it difficult to use these multispectral detection systems in real-world autonomous devices like self-driving cars or security cameras. To try to address this, researchers have developed "knowledge distillation" techniques, where a simpler, faster student model learns from a more complex teacher model. But existing distillation methods only focus on the final combined features, and don't make full use of the original separate camera data.

The paper introduces a new "Adaptive Modal Fusion Distillation (AMFD)" framework that can better leverage the individual camera data to train a fast, efficient student model. It also presents a new challenging multispectral dataset called SMOD to evaluate these models.

Technical Explanation

The paper proposes the "Adaptive Modal Fusion Distillation (AMFD)" framework to address the limitations of existing knowledge distillation methods for multispectral pedestrian detection. AMFD aims to fully utilize the original modal features from the teacher network, rather than just focusing on the fused features.

The key component is the "Modal Extraction Alignment (MEA)" module, which derives learning weights for the student network using both focal and global attention mechanisms. This allows the student to learn an optimal fusion strategy independently, without needing an additional feature fusion module.

The paper also introduces the "SMOD" dataset, a new challenging multispectral dataset for pedestrian detection. Extensive experiments on SMOD as well as the KAIST and LLVIP datasets demonstrate that AMFD outperforms existing state-of-the-art methods in terms of reducing the log-average miss rate and improving mean average precision.

Critical Analysis

The paper makes a compelling case for the AMFD framework as a way to improve the efficiency of multispectral pedestrian detection systems. By better leveraging the individual modal features, AMFD is able to train a student model that performs on par with more complex teacher models, without the increased inference time.

However, the paper does not extensively explore the limitations of the approach. For example, it's unclear how AMFD would scale to an even larger number of modalities beyond just visible and infrared cameras. Additionally, the reliance on attention mechanisms, while effective, could make the student model less interpretable than simpler distillation methods.

Further research could investigate ways to make the AMFD framework more generalized and robust, as well as explore techniques to maintain model interpretability. Insights from related work on causal mode multiplexing, interpretable multi-stage approaches, and efficient multimodal fusion could also be beneficial.

Conclusion

The Adaptive Modal Fusion Distillation (AMFD) framework presented in this paper offers a promising solution to the efficiency challenges of multispectral pedestrian detection. By better leveraging the individual modal features, AMFD can train fast, compact student models that maintain high performance. The introduction of the SMOD dataset also provides a valuable new benchmark for evaluating these types of multimodal detection systems.

While the paper has some limitations, the core ideas behind AMFD could have far-reaching impacts on the deployment of advanced computer vision techniques in real-world autonomous systems, where low latency and resource efficiency are critical. Further research building on this work could lead to even more efficient and robust multimodal detection models, paving the way for safer and more capable self-driving cars, surveillance systems, and other applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔎

AMFD: Distillation via Adaptive Multimodal Fusion for Multispectral Pedestrian Detection

Zizhao Chen, Yeqiang Qian, Xiaoxiao Yang, Chunxiang Wang, Ming Yang

Multispectral pedestrian detection has been shown to be effective in improving performance within complex illumination scenarios. However, prevalent double-stream networks in multispectral detection employ two separate feature extraction branches for multi-modal data, leading to nearly double the inference time compared to single-stream networks utilizing only one feature extraction branch. This increased inference time has hindered the widespread employment of multispectral pedestrian detection in embedded devices for autonomous systems. To address this limitation, various knowledge distillation methods have been proposed. However, traditional distillation methods focus only on the fusion features and ignore the large amount of information in the original multi-modal features, thereby restricting the student network's performance. To tackle the challenge, we introduce the Adaptive Modal Fusion Distillation (AMFD) framework, which can fully utilize the original modal features of the teacher network. Specifically, a Modal Extraction Alignment (MEA) module is utilized to derive learning weights for student networks, integrating focal and global attention mechanisms. This methodology enables the student network to acquire optimal fusion strategies independent from that of teacher network without necessitating an additional feature fusion module. Furthermore, we present the SMOD dataset, a well-aligned challenging multispectral dataset for detection. Extensive experiments on the challenging KAIST, LLVIP and SMOD datasets are conducted to validate the effectiveness of AMFD. The results demonstrate that our method outperforms existing state-of-the-art methods in both reducing log-average Miss Rate and improving mean Average Precision. The code is available at https://github.com/bigD233/AMFD.git.

5/22/2024

MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection

Taeheon Kim, Sangyun Chung, Damin Yeom, Youngjoon Yu, Hak Gu Kim, Yong Man Ro

Multispectral pedestrian detection is attractive for around-the-clock applications due to the complementary information between RGB and thermal modalities. However, current models often fail to detect pedestrians in certain cases (e.g., thermal-obscured pedestrians), particularly due to the modality bias learned from statistically biased datasets. In this paper, we investigate how to mitigate modality bias in multispectral pedestrian detection using Large Language Models (LLMs). Accordingly, we design a Multispectral Chain-of-Thought (MSCoT) prompting strategy, which prompts the LLM to perform multispectral pedestrian detection. Moreover, we propose a novel Multispectral Chain-of-Thought Detection (MSCoTDet) framework that integrates MSCoT prompting into multispectral pedestrian detection. To this end, we design a Language-driven Multi-modal Fusion (LMF) strategy that enables fusing the outputs of MSCoT prompting with the detection results of vision-based multispectral pedestrian detection models. Extensive experiments validate that MSCoTDet effectively mitigates modality biases and improves multispectral pedestrian detection.

5/30/2024

E2E-MFD: Towards End-to-End Synchronous Multimodal Fusion Detection

Jiaqing Zhang, Mingxiang Cao, Xue Yang, Weiying Xie, Jie Lei, Daixun Li, Wenbo Huang, Yunsong Li

Multimodal image fusion and object detection are crucial for autonomous driving. While current methods have advanced the fusion of texture details and semantic information, their complex training processes hinder broader applications. Addressing this challenge, we introduce E2E-MFD, a novel end-to-end algorithm for multimodal fusion detection. E2E-MFD streamlines the process, achieving high performance with a single training phase. It employs synchronous joint optimization across components to avoid suboptimal solutions tied to individual tasks. Furthermore, it implements a comprehensive optimization strategy in the gradient matrix for shared parameters, ensuring convergence to an optimal fusion detection configuration. Our extensive testing on multiple public datasets reveals E2E-MFD's superior capabilities, showcasing not only visually appealing image fusion but also impressive detection outcomes, such as a 3.9% and 2.0% mAP50 increase on horizontal object detection dataset M3FD and oriented object detection dataset DroneVehicle, respectively, compared to state-of-the-art approaches. The code is released at https://github.com/icey-zhang/E2E-MFD.

5/24/2024

Robust Multimodal 3D Object Detection via Modality-Agnostic Decoding and Proximity-based Modality Ensemble

Juhan Cha, Minseok Joo, Jihwan Park, Sanghyeok Lee, Injae Kim, Hyunwoo J. Kim

Recent advancements in 3D object detection have benefited from multi-modal information from the multi-view cameras and LiDAR sensors. However, the inherent disparities between the modalities pose substantial challenges. We observe that existing multi-modal 3D object detection methods heavily rely on the LiDAR sensor, treating the camera as an auxiliary modality for augmenting semantic details. This often leads to not only underutilization of camera data but also significant performance degradation in scenarios where LiDAR data is unavailable. Additionally, existing fusion methods overlook the detrimental impact of sensor noise induced by environmental changes, on detection performance. In this paper, we propose MEFormer to address the LiDAR over-reliance problem by harnessing critical information for 3D object detection from every available modality while concurrently safeguarding against corrupted signals during the fusion process. Specifically, we introduce Modality Agnostic Decoding (MOAD) that extracts geometric and semantic features with a shared transformer decoder regardless of input modalities and provides promising improvement with a single modality as well as multi-modality. Additionally, our Proximity-based Modality Ensemble (PME) module adaptively utilizes the strengths of each modality depending on the environment while mitigating the effects of a noisy sensor. Our MEFormer achieves state-of-the-art performance of 73.9% NDS and 71.5% mAP in the nuScenes validation set. Extensive analyses validate that our MEFormer improves robustness against challenging conditions such as sensor malfunctions or environmental changes. The source code is available at https://github.com/hanchaa/MEFormer

8/20/2024