Visible-Thermal Multiple Object Tracking: Large-scale Video Dataset and Progressive Fusion Approach

Read original: arXiv:2408.00969 - Published 8/6/2024 by Yabin Zhu, Qianwu Wang, Chenglong Li, Jin Tang, Zhixiang Huang

Visible-Thermal Multiple Object Tracking: Large-scale Video Dataset and Progressive Fusion Approach

Overview

This paper introduces a new large-scale video dataset for visible-thermal multiple object tracking, along with a progressive fusion approach to address this task.
The dataset includes over 1.5 million annotated bounding boxes across 1,200 video sequences, covering a diverse range of environments and challenging scenarios.
The proposed progressive fusion approach leverages both visible and thermal modalities to achieve robust and accurate multi-object tracking.

Plain English Explanation

The researchers have developed a new dataset and method for visible-thermal multiple object tracking. The dataset contains over 1.5 million annotated bounding boxes in 1,200 video sequences, covering a wide variety of real-world environments and challenging situations.

The key idea behind their progressive fusion approach is to combine information from both the visible (e.g., color camera) and thermal (e.g., infrared camera) modalities to achieve more robust and accurate multi-object tracking. By progressively fusing the data from these two sources, the method can better handle occlusions, changing lighting conditions, and other challenges that can arise in real-world tracking scenarios.

Technical Explanation

The authors introduce a new large-scale video dataset for visible-thermal multiple object tracking, which contains over 1.5 million annotated bounding boxes across 1,200 video sequences. The dataset covers a diverse range of environments, including indoor and outdoor scenes, and includes challenging scenarios such as occlusions, illumination changes, and object interactions.

To address the task of visible-thermal multiple object tracking, the researchers propose a progressive fusion approach. The method starts by independently processing the visible and thermal modalities using separate tracking modules. It then progressively fuses the information from these two sources, leveraging their complementary strengths to improve tracking performance.

The progressive fusion process involves three main steps:

Initialization: The method initializes object tracks using both visible and thermal cues, ensuring robust initialization even in challenging conditions.
Association: The method associates detections across frames by considering appearance, motion, and thermal features, enabling accurate object-level association.
Update: The method updates the object tracks by dynamically integrating visible and thermal information, adapting to changes in the environment and object characteristics.

By combining the visible and thermal modalities in this progressive manner, the proposed approach can handle occlusions, illumination changes, and other difficulties that often arise in real-world tracking scenarios.

Critical Analysis

The authors have done a commendable job in creating a large-scale and diverse visible-thermal multiple object tracking dataset, which can be a valuable resource for the research community. The progressive fusion approach they propose also represents a promising step towards robust and accurate multi-object tracking in real-world conditions.

However, the paper does not provide a comprehensive evaluation of the method's performance compared to other state-of-the-art visible-thermal tracking approaches. While the authors mention that their method outperforms several baselines, a more detailed comparative analysis with a wider range of competing methods would be helpful to fully assess the strengths and limitations of the proposed approach.

Additionally, the paper does not delve into potential limitations or areas for further research. It would be valuable to see the authors discuss any challenges or caveats associated with their approach, such as computational complexity, sensitivity to parameter tuning, or potential biases in the dataset. Exploring these aspects could help inform future research directions and guide the development of more robust and generalizable visible-thermal tracking solutions.

Conclusion

This paper presents a significant contribution to the field of visible-thermal multiple object tracking. The introduction of a large-scale and diverse video dataset, coupled with the proposed progressive fusion approach, represents an important step forward in addressing the challenges of real-world tracking scenarios. While the technical details are well-explained, the inclusion of a more comprehensive evaluation and discussion of potential limitations could further strengthen the impact of this research. Overall, the work showcases the potential of leveraging complementary modalities to achieve robust and accurate multi-object tracking.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Visible-Thermal Multiple Object Tracking: Large-scale Video Dataset and Progressive Fusion Approach

Yabin Zhu, Qianwu Wang, Chenglong Li, Jin Tang, Zhixiang Huang

The complementary benefits from visible and thermal infrared data are widely utilized in various computer vision task, such as visual tracking, semantic segmentation and object detection, but rarely explored in Multiple Object Tracking (MOT). In this work, we contribute a large-scale Visible-Thermal video benchmark for MOT, called VT-MOT. VT-MOT has the following main advantages. 1) The data is large scale and high diversity. VT-MOT includes 582 video sequence pairs, 401k frame pairs from surveillance, drone, and handheld platforms. 2) The cross-modal alignment is highly accurate. We invite several professionals to perform both spatial and temporal alignment frame by frame. 3) The annotation is dense and high-quality. VT-MOT has 3.99 million annotation boxes annotated and double-checked by professionals, including heavy occlusion and object re-acquisition (object disappear and reappear) challenges. To provide a strong baseline, we design a simple yet effective tracking framework, which effectively fuses temporal information and complementary information of two modalities in a progressive manner, for robust visible-thermal MOT. A comprehensive experiment are conducted on VT-MOT and the results prove the superiority and effectiveness of the proposed method compared with state-of-the-art methods. From the evaluation results and analysis, we specify several potential future directions for visible-thermal MOT. The project is released in https://github.com/wqw123wqw/PFTrack.

8/6/2024

🗣️

Awesome Multi-modal Object Tracking

Chunhui Zhang, Li Liu, Hao Wen, Xi Zhou, Yanfeng Wang

Multi-modal object tracking (MMOT) is an emerging field that combines data from various modalities, eg vision (RGB), depth, thermal infrared, event, language and audio, to estimate the state of an arbitrary object in a video sequence. It is of great significance for many applications such as autonomous driving and intelligent surveillance. In recent years, MMOT has received more and more attention. However, existing MMOT algorithms mainly focus on two modalities (eg RGB+depth, RGB+thermal infrared, and RGB+language). To leverage more modalities, some recent efforts have been made to learn a unified visual object tracking model for any modality. Additionally, some large-scale multi-modal tracking benchmarks have been established by simultaneously providing more than two modalities, such as vision-language-audio (eg WebUAV-3M) and vision-depth-language (eg UniMod1K). To track the latest progress in MMOT, we conduct a comprehensive investigation in this report. Specifically, we first divide existing MMOT tasks into five main categories, ie RGBL tracking, RGBE tracking, RGBD tracking, RGBT tracking, and miscellaneous (RGB+X), where X can be any modality, such as language, depth, and event. Then, we analyze and summarize each MMOT task, focusing on widely used datasets and mainstream tracking algorithms based on their technical paradigms (eg self-supervised learning, prompt learning, knowledge distillation, generative models, and state space models). Finally, we maintain a continuously updated paper list for MMOT at https://github.com/983632847/Awesome-Multimodal-Object-Tracking.

6/3/2024

Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines

Xinyi Ying, Chao Xiao, Ruojing Li, Xu He, Boyang Li, Zhaoxu Li, Yingqian Wang, Mingyuan Hu, Qingyu Xu, Zaiping Lin, Miao Li, Shilin Zhou, Wei An, Weidong Sheng, Li Liu

Small object detection (SOD) has been a longstanding yet challenging task for decades, with numerous datasets and algorithms being developed. However, they mainly focus on either visible or thermal modality, while visible-thermal (RGBT) bimodality is rarely explored. Although some RGBT datasets have been developed recently, the insufficient quantity, limited category, misaligned images and large target size cannot provide an impartial benchmark to evaluate multi-category visible-thermal small object detection (RGBT SOD) algorithms. In this paper, we build the first large-scale benchmark with high diversity for RGBT SOD (namely RGBT-Tiny), including 115 paired sequences, 93K frames and 1.2M manual annotations. RGBT-Tiny contains abundant targets (7 categories) and high-diversity scenes (8 types that cover different illumination and density variations). Note that, over 81% of targets are smaller than 16x16, and we provide paired bounding box annotations with tracking ID to offer an extremely challenging benchmark with wide-range applications, such as RGBT fusion, detection and tracking. In addition, we propose a scale adaptive fitness (SAFit) measure that exhibits high robustness on both small and large targets. The proposed SAFit can provide reasonable performance evaluation and promote detection performance. Based on the proposed RGBT-Tiny dataset and SAFit measure, extensive evaluations have been conducted, including 23 recent state-of-the-art algorithms that cover four different types (i.e., visible generic detection, visible SOD, thermal SOD and RGBT object detection). Project is available at https://github.com/XinyiYing24/RGBT-Tiny.

6/21/2024

STCMOT: Spatio-Temporal Cohesion Learning for UAV-Based Multiple Object Tracking

Jianbo Ma, Chuanming Tang, Fei Wu, Can Zhao, Jianlin Zhang, Zhiyong Xu

Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos is important for diverse applications in computer vision. Current MOT trackers rely on accurate object detection results and precise matching of target reidentification (ReID). These methods focus on optimizing target spatial attributes while overlooking temporal cues in modelling object relationships, especially for challenging tracking conditions such as object deformation and blurring, etc. To address the above-mentioned issues, we propose a novel Spatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT), which utilizes historical embedding features to model the representation of ReID and detection features in a sequential order. Concretely, a temporal embedding boosting module is introduced to enhance the discriminability of individual embedding based on adjacent frame cooperation. While the trajectory embedding is then propagated by a temporal detection refinement module to mine salient target locations in the temporal field. Extensive experiments on the VisDrone2019 and UAVDT datasets demonstrate our STCMOT sets a new state-of-the-art performance in MOTA and IDF1 metrics. The source codes are released at https://github.com/ydhcg-BoBo/STCMOT.

9/18/2024