Multimodal Active Measurement for Human Mesh Recovery in Close Proximity

2310.08116

Published 5/13/2024 by Takahiro Maeda, Keisuke Takeshita, Kazuhito Tanaka

🚀

Abstract

For physical human-robot interactions (pHRI), a robot needs to estimate the accurate body pose of a target person. However, in these pHRI scenarios, the robot cannot fully observe the target person's body with equipped cameras because the target person must be close to the robot for physical interaction. This closeness leads to severe truncation and occlusions and thus results in poor accuracy of human pose estimation. For better accuracy in this challenging environment, we propose an active measurement and sensor fusion framework of the equipped cameras with touch and ranging sensors such as 2D LiDAR. Touch and ranging sensor measurements are sparse, but reliable and informative cues for localizing human body parts. In our active measurement process, camera viewpoints and sensor placements are dynamically optimized to measure body parts with higher estimation uncertainty, which is closely related to truncation or occlusion. In our sensor fusion process, assuming that the measurements of touch and ranging sensors are more reliable than the camera-based estimations, we fuse the sensor measurements to the camera-based estimated pose by aligning the estimated pose towards the measured points. Our proposed method outperformed previous methods on the standard occlusion benchmark with simulated active measurement. Furthermore, our method reliably estimated human poses using a real robot even with practical constraints such as occlusion by blankets.

Create account to get full access

Overview

Robots need to accurately estimate the body pose of a person for physical human-robot interaction (pHRI)
However, when the person is close to the robot, the robot's cameras can't fully observe the person's body due to truncation and occlusions
This leads to poor accuracy in human pose estimation
The researchers propose an active measurement and sensor fusion framework to address this challenge

Plain English Explanation

When robots and humans interact physically, the robots need to know the exact position and pose of the human's body. But when the human is very close to the robot, the robot's cameras can't see the whole body - parts of it get cut off or hidden behind other objects. This makes it really hard for the robot to figure out the human's body pose accurately.

To solve this problem, the researchers came up with a new approach. They use not just the robot's cameras, but also other sensors like touch sensors and laser rangefinders. These additional sensors can detect parts of the human's body that the cameras can't see. The researchers also actively adjust the position of the cameras and sensors to get the best measurements possible, focusing on the parts of the body that are hard to see.

Then, they combine all the sensor measurements to get a more complete and accurate estimate of the human's body pose. They trust the touch and rangefinder sensors more than the camera estimates, since those sensors work better when the human is very close to the robot.

This approach led to better results compared to previous methods, especially in situations with a lot of occlusion and truncation. It also worked well when tested on a real robot, even when the human's body was partially covered by a blanket.

Technical Explanation

The researchers propose an active measurement and sensor fusion framework to address the challenge of accurate human pose estimation in physical human-robot interaction (pHRI) scenarios. In these scenarios, the robot's cameras cannot fully observe the target person's body due to severe truncation and occlusions when the person is in close proximity to the robot.

The key elements of their approach are:

Active Measurement: The researchers dynamically optimize the camera viewpoints and sensor placements to measure body parts with higher estimation uncertainty, which is closely related to truncation or occlusion. This active measurement process aims to get the most informative data from the available sensors.
Sensor Fusion: The researchers assume that the measurements from touch and ranging sensors (like 2D LiDAR) are more reliable than the camera-based pose estimations in this occluded environment. They fuse the sensor measurements by aligning the camera-based estimated pose towards the measured points from the touch and ranging sensors.

The researchers evaluated their proposed method on a standard occlusion benchmark using simulated active measurement, and found it outperformed previous approaches. They also demonstrated that their method can reliably estimate human poses using a real robot, even in practical scenarios with occlusions from blankets.

Critical Analysis

The researchers acknowledge that their active measurement approach relies on having a set of pre-defined regions of interest corresponding to the human body parts. This may limit the flexibility of the system to handle unexpected occlusions or truncations. Additionally, the sensor fusion process assumes that the touch and ranging sensor measurements are consistently more accurate than the camera-based estimates, which may not always be the case in real-world scenarios.

While the results on the simulated benchmark and the real-robot experiments are promising, further research is needed to assess the robustness and generalizability of the approach. Factors such as variable lighting conditions, sensor failures, and complex interactions may pose additional challenges that were not addressed in this study.

It would also be valuable to explore ways to incorporate machine learning techniques, such as those used in Hybrid 3D Human Pose Estimation from Monocular Video or 3D Human Scan from a Moving Event Camera, to further improve the robustness and accuracy of the human pose estimation in occluded pHRI scenarios.

Conclusion

The researchers have proposed an active measurement and sensor fusion framework to address the challenge of accurate human pose estimation in physical human-robot interaction scenarios. By leveraging touch and ranging sensors in addition to cameras, and actively optimizing sensor placements, their approach can produce more reliable estimates of the human's body pose even when the person is in close proximity to the robot and parts of their body are occluded or truncated.

While the results are promising, further research is needed to improve the flexibility and robustness of the system, potentially by incorporating more advanced machine learning techniques, as seen in Ultra-Inertial Poser, Real-Time Simulated Avatar from Head-Mounted, and Leveraging Digital Perceptual Technologies for Remote Perception and Analysis. Overall, this research represents an important step towards enabling more seamless and safe physical interactions between robots and humans.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

New!Multimodal Visual-haptic pose estimation in the presence of transient occlusion

Michael Zechmair, Yannick Morel

Human-robot collaboration requires the establishment of methods to guarantee the safety of participating operators. A necessary part of this process is ensuring reliable human pose estimation. Established vision-based modalities encounter problems when under conditions of occlusion. This article describes the combination of two perception modalities for pose estimation in environments containing such transient occlusion. We first introduce a vision-based pose estimation method, based on a deep Predictive Coding (PC) model featuring robustness to partial occlusion. Next, capacitive sensing hardware capable of detecting various objects is introduced. The sensor is compact enough to be mounted on the exterior of any given robotic system. The technology is particularly well-suited to detection of capacitive material, such as living tissue. Pose estimation from the two individual sensing modalities is combined using a modified Luenberger observer model. We demonstrate that the results offer better performance than either sensor alone. The efficacy of the system is demonstrated on an environment containing a robot arm and a human, showing the ability to estimate the pose of a human forearm under varying levels of occlusion.

6/28/2024

cs.RO

Hybrid 3D Human Pose Estimation with Monocular Video and Sparse IMUs

Yiming Bao, Xu Zhao, Dahong Qian

Temporal 3D human pose estimation from monocular videos is a challenging task in human-centered computer vision due to the depth ambiguity of 2D-to-3D lifting. To improve accuracy and address occlusion issues, inertial sensor has been introduced to provide complementary source of information. However, it remains challenging to integrate heterogeneous sensor data for producing physically rational 3D human poses. In this paper, we propose a novel framework, Real-time Optimization and Fusion (RTOF), to address this issue. We first incorporate sparse inertial orientations into a parametric human skeleton to refine 3D poses in kinematics. The poses are then optimized by energy functions built on both visual and inertial observations to reduce the temporal jitters. Our framework outputs smooth and biomechanically plausible human motion. Comprehensive experiments with ablation studies demonstrate its rationality and efficiency. On Total Capture dataset, the pose estimation error is significantly decreased compared to the baseline method.

4/30/2024

cs.CV cs.HC

❗

3D Human Scan With A Moving Event Camera

Kai Kohyama, Shintaro Shiba, Yoshimitsu Aoki

Capturing a 3D human body is one of the important tasks in computer vision with a wide range of applications such as virtual reality and sports analysis. However, conventional frame cameras are limited by their temporal resolution and dynamic range, which imposes constraints in real-world application setups. Event cameras have the advantages of high temporal resolution and high dynamic range (HDR), but the development of event-based methods is necessary to handle data with different characteristics. This paper proposes a novel event-based method for 3D pose estimation and human mesh recovery. Prior work on event-based human mesh recovery require frames (images) as well as event data. The proposed method solely relies on events; it carves 3D voxels by moving the event camera around a stationary body, reconstructs the human pose and mesh by attenuated rays, and fit statistical body models, preserving high-frequency details. The experimental results show that the proposed method outperforms conventional frame-based methods in the estimation accuracy of both pose and body mesh. We also demonstrate results in challenging situations where a conventional camera has motion blur. This is the first to demonstrate event-only human mesh recovery, and we hope that it is the first step toward achieving robust and accurate 3D human body scanning from vision sensors. https://florpeng.github.io/event-based-human-scan/

4/17/2024

cs.CV

A Robust Filter for Marker-less Multi-person Tracking in Human-Robot Interaction Scenarios

Enrico Martini, Harshil Parekh, Shaoting Peng, Nicola Bombieri, Nadia Figueroa

Pursuing natural and marker-less human-robot interaction (HRI) has been a long-standing robotics research focus, driven by the vision of seamless collaboration without physical markers. Marker-less approaches promise an improved user experience, but state-of-the-art struggles with the challenges posed by intrinsic errors in human pose estimation (HPE) and depth cameras. These errors can lead to issues such as robot jittering, which can significantly impact the trust users have in collaborative systems. We propose a filtering pipeline that refines incomplete 3D human poses from an HPE backbone and a single RGB-D camera to address these challenges, solving for occlusions that can degrade the interaction. Experimental results show that using the proposed filter leads to more consistent and noise-free motion representation, reducing unexpected robot movements and enabling smoother interaction.

6/5/2024

cs.RO cs.AI cs.HC