Semantics-aware Motion Retargeting with Vision-Language Models

Read original: arXiv:2312.01964 - Published 4/16/2024 by Haodong Zhang, ZhiKe Chen, Haocheng Xu, Lei Hao, Xiaofei Wu, Songcen Xu, Zhensong Zhang, Yue Wang, Rong Xiong

Semantics-aware Motion Retargeting with Vision-Language Models

Overview

This paper presents a novel approach for semantics-aware motion retargeting using vision-language models.
The method aims to transfer human motion capture data to virtual characters while preserving the semantic meaning and intent behind the original movements.
By leveraging the capabilities of language models, the approach can generate motions that are semantically consistent with the given text descriptions.

Plain English Explanation

The researchers developed a new way to take human movement data, such as from motion capture, and transfer it to virtual characters or animations. The key innovation is that their method tries to preserve the underlying meaning and purpose behind the original movements, not just the physical motions themselves.

To do this, they use advanced language models that can understand the semantic content and intent conveyed through text descriptions. By combining these vision-language models with the motion data, the system can generate new character animations that not only look realistic, but also express the same meaning as the original human movements.

This is an important advancement because it allows virtual characters to move in a way that is more natural and aligned with their context, rather than just mimicking motions without regard for their significance. It could enable more expressive and semantically coherent animations in video games, movies, and other interactive applications.

Technical Explanation

The core of the approach is a framework that integrates vision-language models with motion retargeting. First, the human motion capture data is encoded into a latent representation. Concurrently, a language model encodes the semantic information from textual descriptions of the actions.

These two modalities - visual motion and linguistic meaning - are then fused together using a multimodal transformer network. This allows the system to learn the relationships between the physical movements and their underlying semantics.

During inference, the model can take a new text description as input and generate a motion sequence that is not only spatiotemporally realistic, but also semantically aligned with the given text. This is achieved by optimizing the latent motion representation to best match the semantics encoded in the language model.

The authors evaluate their approach on several benchmark datasets, demonstrating that it outperforms prior motion retargeting methods in terms of preserving the semantic integrity of the generated motions. They also show how the framework can be applied to applications like virtual character animation and robotic control.

Critical Analysis

A key strength of this work is its ability to go beyond low-level motion imitation and instead capture the higher-level meaning and intent behind human movements. By grounding the generated motions in language understanding, the system can produce more natural and contextually appropriate animations.

However, the paper does not extensively discuss potential limitations or failure cases of the approach. For example, it is unclear how the method would handle highly ambiguous or abstract language descriptions, or how robust it would be to noisy or incomplete textual input.

Additionally, the experiments focus on a relatively narrow set of motion types and scenarios. Further research would be needed to assess the generalizability of the approach to a wider range of human activities and applications.

Another area for potential improvement is the integration of the vision and language components. The current framework treats them as separate modalities that are later fused together. Exploring more tightly coupled architectures that can learn the connections between motion and semantics in a more holistic manner may lead to further performance gains.

Conclusion

This paper presents an innovative approach for semantics-aware motion retargeting that leverages the capabilities of vision-language models. By preserving the underlying meaning and intent behind human movements, the method can generate virtual character animations that are not only realistic, but also semantically coherent and contextually appropriate.

This work has implications for a variety of applications, from animating virtual characters in games and films to controlling the movements of robotic systems. As language models continue to advance, integrating them with motion data in this manner could lead to significant improvements in the realism and expressiveness of computer-generated movement.

While the current framework shows promise, further research is needed to address potential limitations and expand the scope of the approach. Nonetheless, this paper represents an important step forward in the field of motion retargeting and the broader challenge of imbuing artificial systems with a more human-like understanding of movement and its meaning.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Semantics-aware Motion Retargeting with Vision-Language Models

Haodong Zhang, ZhiKe Chen, Haocheng Xu, Lei Hao, Xiaofei Wu, Songcen Xu, Zhensong Zhang, Yue Wang, Rong Xiong

Capturing and preserving motion semantics is essential to motion retargeting between animation characters. However, most of the previous works neglect the semantic information or rely on human-designed joint-level representations. Here, we present a novel Semantics-aware Motion reTargeting (SMT) method with the advantage of vision-language models to extract and maintain meaningful motion semantics. We utilize a differentiable module to render 3D motions. Then the high-level motion semantics are incorporated into the motion retargeting process by feeding the vision-language model with the rendered images and aligning the extracted semantic embeddings. To ensure the preservation of fine-grained motion details and high-level semantics, we adopt a two-stage pipeline consisting of skeleton-aware pre-training and fine-tuning with semantics and geometry constraints. Experimental results show the effectiveness of the proposed method in producing high-quality motion retargeting results while accurately preserving motion semantics.

4/16/2024

🔄

Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion

Manuel Kansy, Jacek Naruniec, Christopher Schroers, Markus Gross, Romann M. Weber

Recent years have seen a tremendous improvement in the quality of video generation and editing approaches. While several techniques focus on editing appearance, few address motion. Current approaches using text, trajectories, or bounding boxes are limited to simple motions, so we specify motions with a single motion reference video instead. We further propose to use a pre-trained image-to-video model rather than a text-to-video model. This approach allows us to preserve the exact appearance and position of a target object or scene and helps disentangle appearance from motion. Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input, while the text/image embedding injected via cross-attention predominantly controls motion. We thus represent motion using text/image embedding tokens. By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity. Once optimized on the motion reference video, this embedding can be applied to various target images to generate videos with semantically similar motions. Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks such as full-body and face reenactment, as well as controlling the motion of inanimate objects and the camera. We empirically demonstrate the effectiveness of our method in the semantic video motion transfer task, significantly outperforming existing methods in this context.

8/2/2024

Spatio-Temporal Motion Retargeting for Quadruped Robots

Taerim Yoon, Dongho Kang, Seungmin Kim, Minsung Ahn, Jin Cheng, Stelian Coros, Sungjoon Choi

This work introduces a motion retargeting approach for legged robots, which aims to create motion controllers that imitate the fine behavior of animals. Our approach, namely spatio-temporal motion retargeting (STMR), guides imitation learning procedures by transferring motion from source to target, effectively bridging the morphological disparities by ensuring the feasibility of imitation on the target system. Our STMR method comprises two components: spatial motion retargeting (SMR) and temporal motion retargeting (TMR). On the one hand, SMR tackles motion retargeting at the kinematic level by generating kinematically feasible whole-body motions from keypoint trajectories. On the other hand, TMR aims to retarget motion at the dynamic level by optimizing motion in the temporal domain. We showcase the effectiveness of our method in facilitating Imitation Learning (IL) for complex animal movements through a series of simulation and hardware experiments. In these experiments, our STMR method successfully tailored complex animal motions from various media, including video captured by a hand-held camera, to fit the morphology and physical properties of the target robots. This enabled RL policy training for precise motion tracking, while baseline methods struggled with highly dynamic motion involving flying phases. Moreover, we validated that the control policy can successfully imitate six different motions in two quadruped robots with different dimensions and physical properties in real-world settings.

9/24/2024

Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis

Zeyi Zhang, Tenglong Ao, Yuyao Zhang, Qingzhe Gao, Chuan Lin, Baoquan Chen, Libin Liu

In this work, we present Semantic Gesticulator, a novel framework designed to synthesize realistic gestures accompanying speech with strong semantic correspondence. Semantically meaningful gestures are crucial for effective non-verbal communication, but such gestures often fall within the long tail of the distribution of natural human motion. The sparsity of these movements makes it challenging for deep learning-based systems, trained on moderately sized datasets, to capture the relationship between the movements and the corresponding speech semantics. To address this challenge, we develop a generative retrieval framework based on a large language model. This framework efficiently retrieves suitable semantic gesture candidates from a motion library in response to the input speech. To construct this motion library, we summarize a comprehensive list of commonly used semantic gestures based on findings in linguistics, and we collect a high-quality motion dataset encompassing both body and hand movements. We also design a novel GPT-based model with strong generalization capabilities to audio, capable of generating high-quality gestures that match the rhythm of speech. Furthermore, we propose a semantic alignment mechanism to efficiently align the retrieved semantic gestures with the GPT's output, ensuring the naturalness of the final animation. Our system demonstrates robustness in generating gestures that are rhythmically coherent and semantically explicit, as evidenced by a comprehensive collection of examples. User studies confirm the quality and human-likeness of our results, and show that our system outperforms state-of-the-art systems in terms of semantic appropriateness by a clear margin.

5/20/2024