Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement

Read original: arXiv:2406.08096 - Published 6/18/2024 by Runyi Yu, Tianyu He, Ailing Zhang, Yuchi Wang, Junliang Guo, Xu Tan, Chang Liu, Jie Chen, Jiang Bian

Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement

Overview

• This paper presents a method for generating high-quality, generalizable lip-synced talking faces by disentangling motion and appearance.

• The approach uses a diffusion model to generate realistic facial animations that match input audio, while maintaining the identity and expression of the actor.

• The model is trained on a diverse dataset of talking faces, allowing it to produce lip-sync for a wide range of actors and scenes.

Plain English Explanation

This research aims to create realistic animated talking faces that match audio input. The key innovation is that the model is able to separate the motion of the lips from the overall appearance and identity of the actor. This means the system can generate lifelike lip movements for any person, while preserving their unique facial features and expressions.

Previous talking face generation methods have struggled to achieve this level of generalization and realism. By training on a diverse dataset, this new approach can produce high-quality lip-sync for a wide variety of actors, scenes, and speaking styles. This could have applications in virtual assistants, animated films, and teleconferencing, among others.

Technical Explanation

The paper presents a diffusion-based model for generating talking faces that disentangles motion and appearance. The model takes in an audio clip and a reference image of an actor, and outputs a video of that actor's face synchronized to the audio.

Key aspects of the approach include:

Appearance and motion encoders: The model uses separate encoders to extract the static appearance features and the dynamic motion features from the input data.
Diffusion-based generation: A diffusion model is used to generate the final lip movements, conditioned on the extracted motion and appearance features.
Training on diverse data: The model is trained on a large dataset of talking face videos, covering a wide range of actors, emotions, and speaking styles.

This disentanglement of motion and appearance, combined with the generalization enabled by diverse training data, allows the model to produce high-quality, realistic lip-sync for a variety of actors and scenarios.

Critical Analysis

The authors note that while their model performs well on a wide range of talking faces, it may struggle with extreme head poses or occlusions that are not well-represented in the training data. Additionally, the current model does not explicitly model 3D facial structure, which could further improve realism and generalization.

Future work could explore incorporating explicit 3D modeling, as well as handling more diverse audio inputs, such as emotional speech or cross-lingual lip-sync. Nonetheless, this paper represents an important step forward in the field of high-fidelity, generalizable talking face generation.

Conclusion

This research presents a novel approach for generating realistic, lip-synced talking faces that can be applied to a wide range of actors and scenarios. By disentangling motion and appearance, the model is able to produce lifelike facial animations that match input audio while preserving the unique identity and expressions of the actor.

The ability to create such high-quality, generalizable talking faces has the potential to enable new applications in areas like virtual assistants, animated films, and teleconferencing. While the current model has some limitations, this work represents an important advance in the field of facial animation and could inspire future research towards even more realistic and versatile talking face generation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement

Runyi Yu, Tianyu He, Ailing Zhang, Yuchi Wang, Junliang Guo, Xu Tan, Chang Liu, Jie Chen, Jiang Bian

We aim to edit the lip movements in talking video according to the given speech while preserving the personal identity and visual details. The task can be decomposed into two sub-problems: (1) speech-driven lip motion generation and (2) visual appearance synthesis. Current solutions handle the two sub-problems within a single generative model, resulting in a challenging trade-off between lip-sync quality and visual details preservation. Instead, we propose to disentangle the motion and appearance, and then generate them one by one with a speech-to-motion diffusion model and a motion-conditioned appearance generation model. However, there still remain challenges in each stage, such as motion-aware identity preservation in (1) and visual details preservation in (2). Therefore, to preserve personal identity, we adopt landmarks to represent the motion, and further employ a landmark-based identity loss. To capture motion-agnostic visual details, we use separate encoders to encode the lip, non-lip appearance and motion, and then integrate them with a learned fusion module. We train MyTalk on a large-scale and diverse dataset. Experiments show that our method generalizes well to the unknown, even out-of-domain person, in terms of both lip sync and visual detail preservation. We encourage the readers to watch the videos on our project page (https://Ingrid789.github.io/MyTalk/).

6/18/2024

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

Weizhi Zhong, Junfan Lin, Peixin Chen, Liang Lin, Guanbin Li

Audio-driven talking face video generation has attracted increasing attention due to its huge industrial potential. Some previous methods focus on learning a direct mapping from audio to visual content. Despite progress, they often struggle with the ambiguity of the mapping process, leading to flawed results. An alternative strategy involves facial structural representations (e.g., facial landmarks) as intermediaries. This multi-stage approach better preserves the appearance details but suffers from error accumulation due to the independent optimization of different stages. Moreover, most previous methods rely on generative adversarial networks, prone to training instability and mode collapse. To address these challenges, our study proposes a novel landmark-based diffusion model for talking face generation, which leverages facial landmarks as intermediate representations while enabling end-to-end optimization. Specifically, we first establish the less ambiguous mapping from audio to landmark motion of lip and jaw. Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks via differentiable cross-attention, which enables end-to-end optimization for improved lip synchronization. Besides, TalkFormer employs implicit feature warping to align the reference image features with the target motion for preserving more appearance details. Extensive experiments demonstrate that our approach can synthesize high-fidelity and lip-synced talking face videos, preserving more subject appearance details from the reference image.

8/13/2024

🛸

Audio-driven Talking Face Generation with Stabilized Synchronization Loss

Dogucan Yaman, Fevziye Irem Eyiokur, Leonard Barmann, Hazim Kemal Ekenel, Alexander Waibel

Talking face generation aims to create realistic videos with accurate lip synchronization and high visual quality, using given audio and reference video while preserving identity and visual characteristics. In this paper, we start by identifying several issues with existing synchronization learning methods. These involve unstable training, lip synchronization, and visual quality issues caused by lip-sync loss, SyncNet, and lip leaking from the identity reference. To address these issues, we first tackle the lip leaking problem by introducing a silent-lip generator, which changes the lips of the identity reference to alleviate leakage. We then introduce stabilized synchronization loss and AVSyncNet to overcome problems caused by lip-sync loss and SyncNet. Experiments show that our model outperforms state-of-the-art methods in both visual quality and lip synchronization. Comprehensive ablation studies further validate our individual contributions and their cohesive effects.

7/19/2024

PersonaTalk: Bring Attention to Your Persona in Visual Dubbing

Longhao Zhang, Shuang Liang, Zhipeng Ge, Tianshu Hu

For audio-driven visual dubbing, it remains a considerable challenge to uphold and highlight speaker's persona while synthesizing accurate lip synchronization. Existing methods fall short of capturing speaker's unique speaking style or preserving facial details. In this paper, we present PersonaTalk, an attention-based two-stage framework, including geometry construction and face rendering, for high-fidelity and personalized visual dubbing. In the first stage, we propose a style-aware audio encoding module that injects speaking style into audio features through a cross-attention layer. The stylized audio features are then used to drive speaker's template geometry to obtain lip-synced geometries. In the second stage, a dual-attention face renderer is introduced to render textures for the target geometries. It consists of two parallel cross-attention layers, namely Lip-Attention and Face-Attention, which respectively sample textures from different reference frames to render the entire face. With our innovative design, intricate facial details can be well preserved. Comprehensive experiments and user studies demonstrate our advantages over other state-of-the-art methods in terms of visual quality, lip-sync accuracy and persona preservation. Furthermore, as a person-generic framework, PersonaTalk can achieve competitive performance as state-of-the-art person-specific methods. Project Page: https://grisoon.github.io/PersonaTalk/.

9/10/2024