Towards Long Term SLAM on Thermal Imagery

2403.19885

Published 4/1/2024 by Colin Keil, Aniket Gupta, Pushyami Kaveti, Hanumant Singh

Towards Long Term SLAM on Thermal Imagery

Abstract

Visual SLAM with thermal imagery, and other low contrast visually degraded environments such as underwater, or in areas dominated by snow and ice, remain a difficult problem for many state of the art (SOTA) algorithms. In addition to challenging front-end data association, thermal imagery presents an additional difficulty for long term relocalization and map reuse. The relative temperatures of objects in thermal imagery change dramatically from day to night. Feature descriptors typically used for relocalization in SLAM are unable to maintain consistency over these diurnal changes. We show that learned feature descriptors can be used within existing Bag of Word based localization schemes to dramatically improve place recognition across large temporal gaps in thermal imagery. In order to demonstrate the effectiveness of our trained vocabulary, we have developed a baseline SLAM system, integrating learned features and matching into a classical SLAM algorithm. Our system demonstrates good local tracking on challenging thermal imagery, and relocalization that overcomes dramatic day to night thermal appearance changes. Our code and datasets are available here: https://github.com/neufieldrobotics/IRSLAM_Baseline

Create account to get full access

Introduction

The provided text discusses the challenges of using visual Simultaneous Localization and Mapping (SLAM) systems, which rely on camera imagery, in environments with poor visibility or significant illumination changes, such as nighttime or adverse weather conditions. It proposes the use of Long-Wave Infrared (LWIR) imagery, commonly known as thermal imaging, as a promising solution to provide visibility in dark, dust-filled, or smoke-filled environments without lighting.

However, the text highlights that temperature-driven appearance changes in outdoor thermal imagery, even over a few hours, pose unique challenges, particularly in feature extraction and localization under varied environmental conditions. Existing feature-based methods are notably less effective with infrared (IR) imagery due to reduced and inconsistent feature extraction in the short term, and inverting image gradients caused by variations in LWIR energy across different objects in the long term.

The text demonstrates that inconsistent feature extraction causes the ORB-based place recognition schemes used in most state-of-the-art (SOTA) visual SLAM systems to be ineffective over temporal gaps of only a few hours. Leading feature-based SLAM systems, such as ORB-SLAM3, encounter significant difficulties. In contrast, state-of-the-art flow-based methods like DROID-SLAM provide reasonable local tracking results but lack an easily exploitable mapping/place recognition model conducive to relocalization within diurnal LWIR datasets. Other flow-based frameworks, including VINS-FUSION and Basalt, rely on BRIEF/ORB features for loop closure detection or relocalization, thus faltering with IR imagery.

Figure 1: Long Wave Infrared (thermal) imagery poses a significant challenge for place recognition due to dramatic appearance changes over the course of a day. At the top we show a pair of images taken with a static camera approximately 12 hours apart. At the bottom we show matches that are recoverable using the Gluestick feature matching pipeline.

The paper presents an approach to enable all-day autonomy for robotic systems using long-wave infrared (LWIR) cameras as the primary sensor. It revisits classical feature extraction techniques and highlights their limitations for LWIR imagery. The authors advocate for using Gluestick, a learning-based method, to extract and match features resistant to illumination changes.

The researchers integrate the learned feature descriptor within MCSLAM to assess its visual SLAM performance. They develop a Bag of Words (BoW) vocabulary employing SuperPoint features from LWIR images captured at various times of the day in urban, outdoor settings on the Northeastern Campus, using handheld or vehicle-mounted cameras.

To evaluate their method and compare against other SLAM systems, the authors collected comprehensive test datasets over 24-hour periods using static and mobile IR cameras with real-time kinematic (RTK) GPS ground truth. These datasets highlight the inadequacies of most existing data collections, particularly in capturing dynamic illumination conditions inherent to outdoor environments.

The experiments indicate that the Gluestick-augmented variant of MCSLAM can track features and achieve relocalization between day and night imagery.

The contributions include an extensive dataset collected with FLIR Boson thermal cameras, a BoW vocabulary using Superpoint features for effective loop closure and Visual Place Recognition (VPR) across day-night datasets, and a feature-based visual SLAM baseline using MCSLAM with Superpoint features and the Gluestick matcher, demonstrating strong local tracking and the ability to save a map during the day and accurately relocalize at night.

Related Work

The paper discusses feature matching and SLAM (Simultaneous Localization and Mapping) techniques for thermal images. Key points:

Feature Matching in Thermal Images:

Classical feature extraction methods like ORB, SIFT, and SURF perform poorly on thermal images due to noise, low contrast, and flat field correction issues.
Learning-based features show improvement but struggle to adapt to thermal images.
Some methods were designed specifically for thermal imagery, but they fail to match images with dramatic intensity differences between day and night.
The paper proposes using Gluestick, a cross-attention based matching scheme robust to challenging conditions like lighting changes and low overlap.

Thermal SLAM:

Classical visual SLAM methods struggle with thermal images and challenging scenarios.
Recent works fuse thermal with visible, LiDAR, or IMU data to overcome limitations, but mostly focus on odometry.
Place recognition and loop closure are crucial for consistent long-term, multi-session mapping between day and night when images look very different.
Bag of Words (BoW) approaches have been instrumental in efficient loop closure detection.
The paper aims to explore day-to-night relocalization within a SLAM framework for thermal images.

The paper then describes the data collection setup for training, analysis, and benchmarking the proposed SLAM system.

Dataset

The paper describes the collection of a varied thermal imagery dataset for evaluating Day-Night relocalization and SLAM performance. Three main types of data were collected:

Test set of 24-hour outdoor timelapses with static cameras in semi-urban environments, captured every 10 minutes. These scenes are used for benchmarking methods where pixel-level accurate ground truth is useful.
A large set of monocular sequences with handheld and vehicle-mounted cameras, without ground truth trajectory information. These are used to augment the Bag-of-Words (BoW) training.
Three sets of matched day and night trajectories of varying sizes with stereo and side-facing cameras. These trajectories follow a pre-determined route during the day and night, allowing for place recognition evaluations between day-to-day, night-to-night, and day-to-night scenarios. Ground truth position data with RTK GPS is provided for benchmarking day-to-night and night-to-day loop closure.

The dataset was captured using FLIR Boson ADK thermal cameras with a resolution of 512x640 and a 75-degree horizontal field of view. For the paired day-night trajectories, a stereo pair with a 1.1m baseline was mounted on a Lincoln MKZ car, along with an RTK GPS antenna.

Traditional camera calibration targets were unsuitable for thermal cameras, so a wooden calibration target with a copper tape checkerboard pattern was used, allowing contrast when heated. Kalibr was employed to estimate intrinsic coefficients for all cameras and extrinsic parameters for the stereo pair.

V Method

The paper discusses several techniques used for image preprocessing, feature extraction and matching, place recognition, and SLAM pipeline development for infrared (IR) imagery.

Image Preprocessing: Raw IR images have poor contrast, so Contrast Limited Adaptive Histogram Equalization (CLAHE) is applied to increase contrast and keypoint extraction, while considering the trade-off between increasing keypoints and amplifying noise.

Feature Extraction and Matching: The Gluestick method, which uses SuperPoint features, performs significantly better than traditional methods like ORB and SIFT for matching day and night IR images. This is because learned features are more robust to noise and do not rely on gradient orientations which change with temperature. Gluestick matches lines in addition to points.

Place Recognition: An IR image vocabulary is built using DBoW2 extended for SuperPoint features matched across image pairs using Gluestick. Training data includes day and night image sequences to improve temporal generalization.

SLAM Pipeline: The Multi Camera SLAM (MCSLAM) framework is adapted by replacing ORB features with SuperPoint descriptors and improving the matching process with Gluestick. This allows using stereo IR cameras and potentially extending to camera arrays. The DBoW2 loop closure mechanism is utilized with the IR SuperPoint vocabulary for relocalization.

Experiments

The paper demonstrates a place recognition system and its effectiveness for infrared (IR) imagery. The key points are:

Evaluation on Loop Closure:

Tested on the KRI dataset by splitting trajectories into two loops.
Built a database from the first loop and searched it with queries from the second loop.
Achieved 100% recall for day-day and night-night loop closures with their IR SuperPoint vocabulary, comparable to ORB features.
For significant day-night temporal gaps, their method showed 91-93% recall while ORB features performed poorly (10-12% recall).

SLAM Evaluation:

Tested their augmented MCSLAM system on IR trajectories for front-end tracking and relocalization across day-night gaps.
Front-end tracking generated reasonably accurate trajectories, though tracking was lost in low-texture regions.
For relocalization, they used the day map and relocalized the night trajectory, achieving < 3m error in most cases relative to GPS ground truth.
Larger relocalization errors occurred when visual features were mainly on distant objects.

The paper demonstrates the effectiveness of their IR place recognition system, enabling loop closures and map reuse across significant day-night changes, which traditional methods like ORB struggle with for thermal imagery.

Conclusion and Future Work

The paper discusses the ability to achieve challenging loop closure and relocalization, enabling map reuse, within long-term infrared (IR) datasets. This is accomplished using a bag-of-words (BoW) system, making it suitable for relatively simple incorporation into existing simultaneous localization and mapping (SLAM) systems. The baseline SLAM system can generate maps using Gluestick for data association, outperforming feature-based SLAM systems that use binary descriptors.

The paper suggests future work to optimize the system for memory usage and speed, as Gluestick is not a perfect drop-in replacement for efficient matching schemes used in other SLAM methods. Improvements and optimizations could be made regarding matching across more than two frames and matching between cameras with known extrinsic parameters.

Additionally, the paper acknowledges that the datasets used are not comprehensive. Future plans include collecting larger, more diverse datasets that include other sensing modalities for comparison, enabling the retraining or fine-tuning of feature extraction and matching, building better vocabularies, and conducting more thorough analysis.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

$BundledSLAM: An Accurate Visual SLAM System Using Multiple Cameras$

BundledSLAM: An Accurate Visual SLAM System Using Multiple Cameras

Han Song, Cong Liu, Huafeng Dai

Multi-camera SLAM systems offer a plethora of advantages, primarily stemming from their capacity to amalgamate information from a broader field of view, thereby resulting in heightened robustness and improved localization accuracy. In this research, we present a significant extension and refinement of the state-of-the-art stereo SLAM system, known as ORB-SLAM2, with the objective of attaining even higher precision.To accomplish this objective, we commence by mapping measurements from all cameras onto a virtual camera termed BundledFrame. This virtual camera is meticulously engineered to seamlessly adapt to multi-camera configurations, facilitating the effective fusion of data captured from multiple cameras. Additionally, we harness extrinsic parameters in the bundle adjustment (BA) process to achieve precise trajectory estimation.Furthermore, we conduct an extensive analysis of the role of bundle adjustment (BA) in the context of multi-camera scenarios, delving into its impact on tracking, local mapping, and global optimization. Our experimental evaluation entails comprehensive comparisons between ground truth data and the state-of-the-art SLAM system. To rigorously assess the system's performance, we utilize the EuRoC datasets. The consistent results of our evaluations demonstrate the superior accuracy of our system in comparison to existing approaches.

4/1/2024

cs.RO

🤿

SL-SLAM: A robust visual-inertial SLAM based deep feature extraction and matching

Zhang Xiao, Shuaixin Li

This paper explores how deep learning techniques can improve visual-based SLAM performance in challenging environments. By combining deep feature extraction and deep matching methods, we introduce a versatile hybrid visual SLAM system designed to enhance adaptability in challenging scenarios, such as low-light conditions, dynamic lighting, weak-texture areas, and severe jitter. Our system supports multiple modes, including monocular, stereo, monocular-inertial, and stereo-inertial configurations. We also perform analysis how to combine visual SLAM with deep learning methods to enlighten other researches. Through extensive experiments on both public datasets and self-sampled data, we demonstrate the superiority of the SL-SLAM system over traditional approaches. The experimental results show that SL-SLAM outperforms state-of-the-art SLAM algorithms in terms of localization accuracy and tracking robustness. For the benefit of community, we make public the source code at https://github.com/zzzzxxxx111/SLslam.

6/5/2024

cs.RO

Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular, Stereo, and RGB-D Cameras

Huajian Huang, Longwei Li, Hui Cheng, Sai-Kit Yeung

The integration of neural rendering and the SLAM system recently showed promising results in joint localization and photorealistic view reconstruction. However, existing methods, fully relying on implicit representations, are so resource-hungry that they cannot run on portable devices, which deviates from the original intention of SLAM. In this paper, we present Photo-SLAM, a novel SLAM framework with a hyper primitives map. Specifically, we simultaneously exploit explicit geometric features for localization and learn implicit photometric features to represent the texture information of the observed environment. In addition to actively densifying hyper primitives based on geometric features, we further introduce a Gaussian-Pyramid-based training method to progressively learn multi-level features, enhancing photorealistic mapping performance. The extensive experiments with monocular, stereo, and RGB-D datasets prove that our proposed system Photo-SLAM significantly outperforms current state-of-the-art SLAM systems for online photorealistic mapping, e.g., PSNR is 30% higher and rendering speed is hundreds of times faster in the Replica dataset. Moreover, the Photo-SLAM can run at real-time speed using an embedded platform such as Jetson AGX Orin, showing the potential of robotics applications.

4/9/2024

cs.CV

🌐

Twofold Structured Features-Based Siamese Network for Infrared Target Tracking

Wei-Jie Yan, Yun-Kai Xu, Qian Chen, Xiao-Fang Kong, Guo-Hua Gu, A-Jun Shao, Min-Jie Wan

Nowadays, infrared target tracking has been a critical technology in the field of computer vision and has many applications, such as motion analysis, pedestrian surveillance, intelligent detection, and so forth. Unfortunately, due to the lack of color, texture and other detailed information, tracking drift often occurs when the tracker encounters infrared targets that vary in size or shape. To address this issue, we present a twofold structured features-based Siamese network for infrared target tracking. First of all, in order to improve the discriminative capacity for infrared targets, a novel feature fusion network is proposed to fuse both shallow spatial information and deep semantic information into the extracted features in a comprehensive manner. Then, a multi-template update module based on template update mechanism is designed to effectively deal with interferences from target appearance changes which are prone to cause early tracking failures. Finally, both qualitative and quantitative experiments are carried out on VOT-TIR 2016 dataset, which demonstrates that our method achieves the balance of promising tracking performance and real-time tracking speed against other out-of-the-art trackers.

6/28/2024

eess.IV