Joint Spatial-Temporal Modeling and Contrastive Learning for Self-supervised Heart Rate Measurement

2406.04942

Published 6/10/2024 by Wei Qian, Qi Li, Kun Li, Xinke Wang, Xiao Sun, Meng Wang, Dan Guo

Joint Spatial-Temporal Modeling and Contrastive Learning for Self-supervised Heart Rate Measurement

Abstract

This paper briefly introduces the solutions developed by our team, HFUT-VUT, for Track 1 of self-supervised heart rate measurement in the 3rd Vision-based Remote Physiological Signal Sensing (RePSS) Challenge hosted at IJCAI 2024. The goal is to develop a self-supervised learning algorithm for heart rate (HR) estimation using unlabeled facial videos. To tackle this task, we present two self-supervised HR estimation solutions that integrate spatial-temporal modeling and contrastive learning, respectively. Specifically, we first propose a non-end-to-end self-supervised HR measurement framework based on spatial-temporal modeling, which can effectively capture subtle rPPG clues and leverage the inherent bandwidth and periodicity characteristics of rPPG to constrain the model. Meanwhile, we employ an excellent end-to-end solution based on contrastive learning, aiming to generalize across different scenarios from complementary perspectives. Finally, we combine the strengths of the above solutions through an ensemble strategy to generate the final predictions, leading to a more accurate HR estimation. As a result, our solutions achieved a remarkable RMSE score of 8.85277 on the test dataset, securing textbf{2nd place} in Track 1 of the challenge.

Create account to get full access

Overview

This paper presents a self-supervised method for measuring heart rate from video data using a joint spatial-temporal modeling approach and contrastive learning.
The proposed framework learns robust heart rate estimation without relying on ground truth heart rate labels, which can be costly or difficult to obtain.
The method leverages both spatial and temporal information in video data to learn a heart rate prediction model in a self-supervised manner.

Plain English Explanation

The researchers have developed a new way to measure a person's heart rate using only video footage of their face, without requiring any additional sensors or equipment. This is particularly useful because obtaining accurate heart rate measurements can sometimes be challenging or expensive.

Their approach works by training an AI system to learn the patterns in the video that are associated with a person's heartbeat. It does this in a "self-supervised" way, meaning the system figures out these patterns on its own, without being explicitly told what a normal heart rate looks like.

The key innovation is that the system looks at both the spatial information (what the person's face looks like) and the temporal information (how the face changes over time) to piece together the subtle changes that happen with each heartbeat. This joint spatial-temporal modeling allows the system to learn a more robust and accurate heart rate prediction model.

[The paper builds on related research in this area, such as the work on chaos-motion-unveiling-robustness-remote-heart-rate, sinc-adaptive-camera-based-vitals-unsupervised-learning, and self-supervised-learning-interventional-image-analytics-towards.]

Technical Explanation

The proposed framework consists of two main components:

Spatial-Temporal Modeling: The system learns to capture both the spatial (facial appearance) and temporal (facial motion) information in the video data through a series of convolutional and recurrent neural network layers. This joint modeling allows the system to learn a more comprehensive representation of the physiological signals associated with the heartbeat.
Contrastive Learning: The system is trained in a self-supervised manner using a contrastive learning approach. It learns to predict whether two video frames belong to the same heartbeat cycle or not, without access to ground truth heart rate labels. This forces the system to discover the inherent patterns in the data that are indicative of the heartbeat.

[The paper builds on related technical approaches, such as the work on whole-heart-3dt-representation-learning-through-sparse and kid-ppg-knowledge-informed-deep-learning-extracting.]

Through extensive experiments on multiple datasets, the authors demonstrate that their self-supervised approach can achieve heart rate estimation accuracy competitive with supervised methods, without requiring any ground truth labels.

Critical Analysis

The paper presents a novel and promising approach for self-supervised heart rate estimation from video data. The key strengths of the work include the joint spatial-temporal modeling and the contrastive learning framework, which allow the system to learn robust heart rate prediction without relying on ground truth labels.

However, the paper does not address some potential limitations of the approach. For example, the system may struggle with video data collected in challenging real-world conditions, such as with significant head movement or varying lighting. Additionally, the paper does not discuss the computational and memory requirements of the proposed framework, which could be an important practical consideration.

Further research could explore ways to improve the robustness of the method, such as by incorporating additional modalities (e.g., audio) or adapting the framework to handle a wider range of real-world scenarios. Additionally, it would be interesting to see how the self-supervised approach compares to fully supervised methods in terms of generalization and transfer learning capabilities.

Conclusion

Overall, this paper presents an innovative self-supervised approach for heart rate estimation from video data, which has the potential to significantly improve the accessibility and practicality of remote health monitoring applications. The joint spatial-temporal modeling and contrastive learning framework represent an important step forward in the field of self-supervised physiological signal processing.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Chaos in Motion: Unveiling Robustness in Remote Heart Rate Measurement through Brain-Inspired Skin Tracking

Jie Wang, Jing Lian, Minjie Ma, Junqiang Lei, Chunbiao Li, Bin Li, Jizhao Liu

Heart rate is an important physiological indicator of human health status. Existing remote heart rate measurement methods typically involve facial detection followed by signal extraction from the region of interest (ROI). These SOTA methods have three serious problems: (a) inaccuracies even failures in detection caused by environmental influences or subject movement; (b) failures for special patients such as infants and burn victims; (c) privacy leakage issues resulting from collecting face video. To address these issues, we regard the remote heart rate measurement as the process of analyzing the spatiotemporal characteristics of the optical flow signal in the video. We apply chaos theory to computer vision tasks for the first time, thus designing a brain-inspired framework. Firstly, using an artificial primary visual cortex model to extract the skin in the videos, and then calculate heart rate by time-frequency analysis on all pixels. Our method achieves Robust Skin Tracking for Heart Rate measurement, called HR-RST. The experimental results show that HR-RST overcomes the difficulty of environmental influences and effectively tracks the subject movement. Moreover, the method could extend to other body parts. Consequently, the method can be applied to special patients and effectively protect individual privacy, offering an innovative solution.

4/12/2024

cs.CV

SiNC+: Adaptive Camera-Based Vitals with Unsupervised Learning of Periodic Signals

Jeremy Speth, Nathan Vance, Patrick Flynn, Adam Czajka

Subtle periodic signals, such as blood volume pulse and respiration, can be extracted from RGB video, enabling noncontact health monitoring at low cost. Advancements in remote pulse estimation -- or remote photoplethysmography (rPPG) -- are currently driven by deep learning solutions. However, modern approaches are trained and evaluated on benchmark datasets with ground truth from contact-PPG sensors. We present the first non-contrastive unsupervised learning framework for signal regression to mitigate the need for labelled video data. With minimal assumptions of periodicity and finite bandwidth, our approach discovers the blood volume pulse directly from unlabelled videos. We find that encouraging sparse power spectra within normal physiological bandlimits and variance over batches of power spectra is sufficient for learning visual features of periodic signals. We perform the first experiments utilizing unlabelled video data not specifically created for rPPG to train robust pulse rate estimators. Given the limited inductive biases, we successfully applied the same approach to camera-based respiration by changing the bandlimits of the target signal. This shows that the approach is general enough for unsupervised learning of bandlimited quasi-periodic signals from different domains. Furthermore, we show that the framework is effective for finetuning models on unlabelled video from a single subject, allowing for personalized and adaptive signal regressors.

4/23/2024

cs.CV cs.AI cs.LG

Self-Supervised Learning for Interventional Image Analytics: Towards Robust Device Trackers

Saahil Islam, Venkatesh N. Murthy, Dominik Neumann, Badhan Kumar Das, Puneet Sharma, Andreas Maier, Dorin Comaniciu, Florin C. Ghesu

An accurate detection and tracking of devices such as guiding catheters in live X-ray image acquisitions is an essential prerequisite for endovascular cardiac interventions. This information is leveraged for procedural guidance, e.g., directing stent placements. To ensure procedural safety and efficacy, there is a need for high robustness no failures during tracking. To achieve that, one needs to efficiently tackle challenges, such as: device obscuration by contrast agent or other external devices or wires, changes in field-of-view or acquisition angle, as well as the continuous movement due to cardiac and respiratory motion. To overcome the aforementioned challenges, we propose a novel approach to learn spatio-temporal features from a very large data cohort of over 16 million interventional X-ray frames using self-supervision for image sequence data. Our approach is based on a masked image modeling technique that leverages frame interpolation based reconstruction to learn fine inter-frame temporal correspondences. The features encoded in the resulting model are fine-tuned downstream. Our approach achieves state-of-the-art performance and in particular robustness compared to ultra optimized reference solutions (that use multi-stage feature fusion, multi-task and flow regularization). The experiments show that our method achieves 66.31% reduction in maximum tracking error against reference solutions (23.20% when flow regularization is used); achieving a success score of 97.95% at a 3x faster inference speed of 42 frames-per-second (on GPU). The results encourage the use of our approach in various other tasks within interventional image analytics that require effective understanding of spatio-temporal semantics.

5/3/2024

cs.CV cs.AI

Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition

Weichao Zhao, Wengang Zhou, Hezhen Hu, Min Wang, Houqiang Li

Recently, there have been efforts to improve the performance in sign language recognition by designing self-supervised learning methods. However, these methods capture limited information from sign pose data in a frame-wise learning manner, leading to sub-optimal solutions. To this end, we propose a simple yet effective self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency from two distinct perspectives and learn instance discriminative representation for sign language recognition. On one hand, since the semantics of sign language are expressed by the cooperation of fine-grained hands and coarse-grained trunks, we utilize both granularity information and encode them into latent spaces. The consistency between hand and trunk features is constrained to encourage learning consistent representation of instance samples. On the other hand, inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling. Additionally, we further bridge the interaction between the embedding spaces of both modalities, facilitating bidirectional knowledge transfer to enhance sign language representation. Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin. The source code is publicly available at https://github.com/sakura/Code.

6/18/2024

cs.CV