CoMo: Controllable Motion Generation through Language Guided Pose Code Editing

Read original: arXiv:2403.13900 - Published 9/20/2024 by Yiming Huang, Weilin Wan, Yue Yang, Chris Callison-Burch, Mark Yatskar, Lingjie Liu

CoMo: Controllable Motion Generation through Language Guided Pose Code Editing

Overview

This paper presents a method called CoMo for generating controllable human motion through language-guided pose code editing.
The key idea is to use a language model to edit the latent pose codes of an existing motion, allowing for fine-grained control over the generated motion.
The method is demonstrated on various tasks, including motion editing, motion interpolation, and text-driven motion generation.

Plain English Explanation

The paper introduces a new technique called CoMo that allows you to control the motion of a virtual character by using language. The basic idea is to take an existing motion, like a person walking or dancing, and then use a language model to tweak and edit the underlying "pose codes" that define the motion. This gives you the ability to make very specific changes to the motion just by typing in a description.

For example, you could take a walking motion and then use language to make the person walk more slowly, or to make them walk with a limp. Or you could take a dancing motion and use language to make the person dance more energetically or with more elegance. The language model helps translate your textual instructions into the right changes to the underlying pose codes that control the motion.

This language-guided pose code editing approach provides a lot of fine-grained control over the generated motion, allowing you to customize it in powerful ways. The researchers demonstrate this technique on a variety of tasks, like editing existing motions, blending and interpolating between different motions, and even generating brand new motions just from text descriptions.

Overall, the CoMo method represents an exciting advance in the field of motion synthesis and motion editing, giving creators and animators a new tool to bring virtual characters to life through the expressive power of language.

Technical Explanation

The key technical innovation of the CoMo method is the use of a language model to guide the editing of the latent "pose codes" that define a human motion.

The system first encodes an input motion sequence into a compact latent representation using a motion encoder network. Then, a language model is trained to map natural language descriptions to the changes that should be made to these latent pose codes in order to modify the motion in specific ways.

During inference, the user provides a text prompt describing the desired changes, which the language model translates into instructions for how to edit the pose codes. The edited pose codes are then fed into a motion decoder network to generate the final, edited motion sequence.

The researchers demonstrate the capabilities of CoMo on a variety of tasks, including:

Motion Editing: Editing existing motions to match a language description (e.g. "walk more slowly")
Motion Interpolation: Blending between different motions using language-guided interpolation
Text-driven Motion Generation: Generating new motion sequences from scratch based on a text prompt

The experiments show that CoMo can produce high-quality, controllable motion results that closely match the user's language-based instructions. This language-based control represents a significant advance over previous motion generation and editing techniques that relied more on manual, low-level control.

Critical Analysis

One key limitation of the CoMo method is that it relies on having access to a large dataset of high-quality motion capture data in order to train the motion encoder and decoder networks. This motion data can be expensive and time-consuming to collect.

The paper also doesn't fully address how well the language model generalizes to novel motion types or styles that are not well represented in the training data. There may be some brittleness or biases in the language model's understanding of motion that could limit its broader applicability.

Additionally, the paper does not provide a detailed analysis of the computational costs and latency of the CoMo system, which could be an important practical consideration for real-time applications like video games or animation.

Overall, while CoMo represents an exciting advance in language-guided motion control, there are still some open challenges and areas for further research to improve the robustness, generalization, and efficiency of this approach.

Conclusion

The CoMo method presented in this paper offers a novel way to generate controllable human motion by leveraging the expressive power of natural language. By using a language model to guide the editing of latent pose codes, the system allows for fine-grained control over the generated motions, enabling a wide range of applications in animation, gaming, and virtual reality.

While the current implementation has some limitations, the core idea of language-guided motion control is a significant step forward in the field of motion synthesis and editing. Further research to address the technical challenges and expand the capabilities of this approach could unlock new possibilities for how we create and interact with virtual characters and environments.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

CoMo: Controllable Motion Generation through Language Guided Pose Code Editing

Yiming Huang, Weilin Wan, Yue Yang, Chris Callison-Burch, Mark Yatskar, Lingjie Liu

Text-to-motion models excel at efficient human motion generation, but existing approaches lack fine-grained controllability over the generation process. Consequently, modifying subtle postures within a motion or inserting new actions at specific moments remains a challenge, limiting the applicability of these methods in diverse scenarios. In light of these challenges, we introduce CoMo, a Controllable Motion generation model, adept at accurately generating and editing motions by leveraging the knowledge priors of large language models (LLMs). Specifically, CoMo decomposes motions into discrete and semantically meaningful pose codes, with each code encapsulating the semantics of a body part, representing elementary information such as left knee slightly bent. Given textual inputs, CoMo autoregressively generates sequences of pose codes, which are then decoded into 3D motions. Leveraging pose codes as interpretable representations, an LLM can directly intervene in motion editing by adjusting the pose codes according to editing instructions. Experiments demonstrate that CoMo achieves competitive performance in motion generation compared to state-of-the-art models while, in human studies, CoMo substantially surpasses previous work in motion editing abilities.

9/20/2024

CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation

Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, Arash Vahdat

Recently video diffusion models have emerged as expressive generative tools for high-quality video content creation readily available to general users. However, these models often do not offer precise control over camera poses for video generation, limiting the expression of cinematic language and user control. To address this issue, we introduce CamCo, which allows fine-grained Camera pose Control for image-to-video generation. We equip a pre-trained image-to-video generator with accurately parameterized camera pose input using Plucker coordinates. To enhance 3D consistency in the videos produced, we integrate an epipolar attention module in each attention block that enforces epipolar constraints to the feature maps. Additionally, we fine-tune CamCo on real-world videos with camera poses estimated through structure-from-motion algorithms to better synthesize object motion. Our experiments show that CamCo significantly improves 3D consistency and camera control capabilities compared to previous models while effectively generating plausible object motion. Project page: https://ir1d.github.io/CamCo/

6/5/2024

ViMo: Generating Motions from Casual Videos

Liangdong Qiu, Chengxing Yu, Yanran Li, Zhao Wang, Haibin Huang, Chongyang Ma, Di Zhang, Pengfei Wan, Xiaoguang Han

Although humans have the innate ability to imagine multiple possible actions from videos, it remains an extraordinary challenge for computers due to the intricate camera movements and montages. Most existing motion generation methods predominantly rely on manually collected motion datasets, usually tediously sourced from motion capture (Mocap) systems or Multi-View cameras, unavoidably resulting in a limited size that severely undermines their generalizability. Inspired by recent advance of diffusion models, we probe a simple and effective way to capture motions from videos and propose a novel Video-to-Motion-Generation framework (ViMo) which could leverage the immense trove of untapped video content to produce abundant and diverse 3D human motions. Distinct from prior work, our videos could be more causal, including complicated camera movements and occlusions. Striking experimental results demonstrate the proposed model could generate natural motions even for videos where rapid movements, varying perspectives, or frequent occlusions might exist. We also show this work could enable three important downstream applications, such as generating dancing motions according to arbitrary music and source video style. Extensive experimental results prove that our model offers an effective and scalable way to generate diversity and realistic motions. Code and demos will be public soon.

8/14/2024

🌿

Iterative Motion Editing with Natural Language

Purvi Goel, Kuan-Chieh Wang, C. Karen Liu, Kayvon Fatahalian

Text-to-motion diffusion models can generate realistic animations from text prompts, but do not support fine-grained motion editing controls. In this paper, we present a method for using natural language to iteratively specify local edits to existing character animations, a task that is common in most computer animation workflows. Our key idea is to represent a space of motion edits using a set of kinematic motion editing operators (MEOs) whose effects on the source motion is well-aligned with user expectations. We provide an algorithm that leverages pre-existing language models to translate textual descriptions of motion edits into source code for programs that define and execute sequences of MEOs on a source animation. We execute MEOs by first translating them into keyframe constraints, and then use diffusion-based motion models to generate output motions that respect these constraints. Through a user study and quantitative evaluation, we demonstrate that our system can perform motion edits that respect the animator's editing intent, remain faithful to the original animation (it edits the original animation, but does not dramatically change it), and yield realistic character animation results.

6/4/2024