X-Oscar: A Progressive Framework for High-quality Text-guided 3D Animatable Avatar Generation

Read original: arXiv:2405.00954 - Published 5/3/2024 by Yiwei Ma, Zhekai Lin, Jiayi Ji, Yijun Fan, Xiaoshuai Sun, Rongrong Ji

Overview

Introduces a progressive framework called X-Oscar for generating high-quality, text-guided, 3D animatable avatars.
Leverages the latest advancements in text-to-image and 3D modeling to create realistic and expressive avatars.
Aims to address limitations of existing avatar generation approaches, such as lack of fine-grained control, low fidelity, and restricted animation capabilities.

Plain English Explanation

The paper presents a new system called X-Oscar that makes it easier to create high-quality, animated 3D avatars from text descriptions. This is an important problem because current avatar generation methods often struggle to produce realistic and customizable characters.

X-Oscar works by combining advanced text-to-image and 3D modeling techniques. First, it takes a text description as input and generates a realistic 2D image of the desired avatar. Then, it uses this 2D image as a starting point to construct a fully 3D, animatable model of the avatar. This allows users to create avatars that not only look lifelike, but can also be animated and expressed in different ways.

Compared to previous approaches, X-Oscar offers several key advantages. It gives users more fine-grained control over the avatar's appearance and behavior through the text inputs. It also produces avatars with higher visual fidelity and more natural animations. This makes X-Oscar a powerful tool for applications like virtual reality, games, and online communication, where realistic and expressive avatars are essential.

Technical Explanation

The X-Oscar framework consists of two main components: a text-to-image generator and a 3D animation module. The text-to-image generator takes a text description as input and produces a 2D image of the desired avatar. This is achieved using a text-driven diverse facial texture generation via generator-discriminator approach.

The 3D animation module then uses the 2D image as a reference to construct a fully 3D, animatable model of the avatar. This is done by leveraging recent advancements in generic expression-aware volumetric head avatars, efficient Gaussian-based head avatars, and efficient animatable human modeling from monocular images.

The resulting 3D avatar can be animated and expressed in various ways, thanks to the animatable 3D Gaussian avatars with implicit mesh approach used in the framework.

Critical Analysis

The paper provides a comprehensive and promising solution for generating high-quality, text-guided, 3D animatable avatars. However, it's important to note a few potential limitations and areas for further research:

The authors acknowledge that the text-to-image generation component may still struggle with producing highly detailed or diverse facial features. Continued advancements in text-to-image models could help address this.
The 3D animation module relies on several existing techniques, which may have their own limitations in terms of realism, flexibility, or computational efficiency. Integrating more advanced 3D modeling and animation methods could further improve the overall quality and performance of the system.
The paper focuses on generating static avatars, but the ability to create dynamic, emotion-driven avatars could be an interesting area for future exploration.
The evaluation of the system is primarily based on subjective user feedback and qualitative assessments. Incorporating more objective performance metrics and benchmarks could provide a more comprehensive understanding of the system's capabilities.

Despite these potential areas for improvement, the X-Oscar framework represents a significant step forward in the field of text-guided 3D avatar generation, with promising applications in virtual reality, gaming, and online communication.

Conclusion

The X-Oscar framework introduced in this paper provides a progressive approach to generating high-quality, text-guided, 3D animatable avatars. By integrating state-of-the-art text-to-image and 3D modeling techniques, the system offers users greater control and flexibility in creating realistic and expressive virtual characters.

The ability to generate avatars from text descriptions has the potential to revolutionize the way we interact and communicate in virtual environments. As the field of avatar generation continues to evolve, the X-Oscar framework serves as an important milestone, demonstrating the remarkable progress that can be achieved by leveraging the latest advancements in artificial intelligence and computer graphics.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

X-Oscar: A Progressive Framework for High-quality Text-guided 3D Animatable Avatar Generation

Yiwei Ma, Zhekai Lin, Jiayi Ji, Yijun Fan, Xiaoshuai Sun, Rongrong Ji

Recent advancements in automatic 3D avatar generation guided by text have made significant progress. However, existing methods have limitations such as oversaturation and low-quality output. To address these challenges, we propose X-Oscar, a progressive framework for generating high-quality animatable avatars from text prompts. It follows a sequential Geometry->Texture->Animation paradigm, simplifying optimization through step-by-step generation. To tackle oversaturation, we introduce Adaptive Variational Parameter (AVP), representing avatars as an adaptive distribution during training. Additionally, we present Avatar-aware Score Distillation Sampling (ASDS), a novel technique that incorporates avatar-aware noise into rendered images for improved generation quality during optimization. Extensive evaluations confirm the superiority of X-Oscar over existing text-to-3D and text-to-avatar approaches. Our anonymous project page: https://xmu-xiaoma666.github.io/Projects/X-Oscar/.

5/3/2024

🛸

STAR: Skeleton-aware Text-based 4D Avatar Generation with In-Network Motion Retargeting

Zenghao Chai, Chen Tang, Yongkang Wong, Mohan Kankanhalli

The creation of 4D avatars (i.e., animated 3D avatars) from text description typically uses text-to-image (T2I) diffusion models to synthesize 3D avatars in the canonical space and subsequently applies animation with target motions. However, such an optimization-by-animation paradigm has several drawbacks. (1) For pose-agnostic optimization, the rendered images in canonical pose for naive Score Distillation Sampling (SDS) exhibit domain gap and cannot preserve view-consistency using only T2I priors, and (2) For post hoc animation, simply applying the source motions to target 3D avatars yields translation artifacts and misalignment. To address these issues, we propose Skeleton-aware Text-based 4D Avatar generation with in-network motion Retargeting (STAR). STAR considers the geometry and skeleton differences between the template mesh and target avatar, and corrects the mismatched source motion by resorting to the pretrained motion retargeting techniques. With the informatively retargeted and occlusion-aware skeleton, we embrace the skeleton-conditioned T2I and text-to-video (T2V) priors, and propose a hybrid SDS module to coherently provide multi-view and frame-consistent supervision signals. Hence, STAR can progressively optimize the geometry, texture, and motion in an end-to-end manner. The quantitative and qualitative experiments demonstrate our proposed STAR can synthesize high-quality 4D avatars with vivid animations that align well with the text description. Additional ablation studies shows the contributions of each component in STAR. The source code and demos are available at: href{https://star-avatar.github.io}{https://star-avatar.github.io}.

6/10/2024

Barbie: Text to Barbie-Style 3D Avatars

Xiaokun Sun, Zhenyu Zhang, Ying Tai, Qian Wang, Hao Tang, Zili Yi, Jian Yang

Recent advances in text-guided 3D avatar generation have made substantial progress by distilling knowledge from diffusion models. Despite the plausible generated appearance, existing methods cannot achieve fine-grained disentanglement or high-fidelity modeling between inner body and outfit. In this paper, we propose Barbie, a novel framework for generating 3D avatars that can be dressed in diverse and high-quality Barbie-like garments and accessories. Instead of relying on a holistic model, Barbie achieves fine-grained disentanglement on avatars by semantic-aligned separated models for human body and outfits. These disentangled 3D representations are then optimized by different expert models to guarantee the domain-specific fidelity. To balance geometry diversity and reasonableness, we propose a series of losses for template-preserving and human-prior evolving. The final avatar is enhanced by unified texture refinement for superior texture consistency. Extensive experiments demonstrate that Barbie outperforms existing methods in both dressed human and outfit generation, supporting flexible apparel combination and animation. The code will be released for research purposes. Our project page is: https://xiaokunsun.github.io/Barbie.github.io/.

9/9/2024

InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

Yuchi Wang, Junliang Guo, Jianhong Bai, Runyi Yu, Tianyu He, Xu Tan, Xu Sun, Jiang Bian

Recent talking avatar generation models have made strides in achieving realistic and accurate lip synchronization with the audio, but often fall short in controlling and conveying detailed expressions and emotions of the avatar, making the generated video less vivid and controllable. In this paper, we propose a novel text-guided approach for generating emotionally expressive 2D avatars, offering fine-grained control, improved interactivity, and generalizability to the resulting video. Our framework, named InstructAvatar, leverages a natural language interface to control the emotion as well as the facial motion of avatars. Technically, we design an automatic annotation pipeline to construct an instruction-video paired training dataset, equipped with a novel two-branch diffusion-based generator to predict avatars with audio and text instructions at the same time. Experimental results demonstrate that InstructAvatar produces results that align well with both conditions, and outperforms existing methods in fine-grained emotion control, lip-sync quality, and naturalness. Our project page is https://wangyuchi369.github.io/InstructAvatar/.

5/27/2024