InstructHumans: Editing Animated 3D Human Textures with Instructions

Read original: arXiv:2404.04037 - Published 4/8/2024 by Jiayin Zhu, Linlin Yang, Angela Yao

InstructHumans: Editing Animated 3D Human Textures with Instructions

Overview

This paper introduces InstructHumans, a system that allows users to edit the textures of animated 3D human models using natural language instructions.
The system leverages large language models and diffusion models to generate realistic texture edits based on textual prompts.
InstructHumans can handle a variety of editing tasks, from changing the color and style of clothing to adding or removing accessories and tattoos.

Plain English Explanation

InstructHumans is a new technology that lets you edit the appearance of 3D human models using simple text instructions. Instead of having to use complex 3D modeling software, you can just type something like "make the shirt blue" or "add a tattoo on the arm," and the system will automatically update the model accordingly.

This is possible thanks to large language models, which can understand the meaning of your instructions, and diffusion models, which can generate realistic-looking textures and patterns. The system takes your text prompt, processes it to understand your intent, and then generates the corresponding changes to the 3D model's appearance.

The ability to edit 3D human models using natural language instructions can be really useful for a variety of applications, such as creating custom character designs for video games, experimenting with different fashion styles, or visualizing physical changes to the human body. Instead of having to manually tweak the 3D model, you can just describe what you want, and the system will take care of the rest.

Technical Explanation

The core of InstructHumans is a neural network architecture that combines a text-to-image diffusion model with a 3D human mesh reconstruction model. The text-to-image model takes the user's natural language instructions and generates a corresponding texture image, while the 3D reconstruction model applies that texture to the animated human model.

The key innovation in InstructHumans is the way it aligns the generated texture to the specific regions of the 3D model, such as the shirt, pants, or skin. This is achieved by incorporating a differentiable renderer that can map the 2D texture back onto the 3D surface, enabling end-to-end training of the entire system.

The authors demonstrate the versatility of InstructHumans through a wide range of editing tasks, from changing the color and style of clothing to adding or removing accessories and tattoos. The system is able to generate high-quality, photorealistic results that closely match the user's textual prompts.

Critical Analysis

One potential limitation of InstructHumans is the reliance on the underlying 3D human mesh reconstruction model, which may introduce errors or artifacts if the initial 3D model is not accurate. Additionally, the system's performance may be constrained by the capabilities of the text-to-image diffusion model, which could struggle with highly complex or abstract editing instructions.

The authors acknowledge these challenges and suggest areas for future research, such as incorporating user feedback loops to refine the texture edits, or exploring ways to better integrate the 3D reconstruction and texture generation components of the system.

Overall, InstructHumans represents an exciting step forward in the field of 3D human appearance editing, demonstrating the potential of language-guided interfaces to democratize and simplify the creation of custom 3D content.

Conclusion

The InstructHumans system introduces a novel approach to editing the textures of animated 3D human models using natural language instructions. By combining powerful text-to-image and 3D reconstruction models, the system enables users to easily modify the appearance of virtual humans in a wide variety of ways, from changing clothing styles to adding accessories and tattoos.

This technology has numerous potential applications, from character design for video games and movies to fashion experimentation and visualization of physical changes. While the system has some limitations, the authors' work represents a significant advancement in the field of 3D human appearance editing, and their approach could inspire further research and development in this area.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

InstructHumans: Editing Animated 3D Human Textures with Instructions

Jiayin Zhu, Linlin Yang, Angela Yao

We present InstructHumans, a novel framework for instruction-driven 3D human texture editing. Existing text-based editing methods use Score Distillation Sampling (SDS) to distill guidance from generative models. This work shows that naively using such scores is harmful to editing as they destroy consistency with the source avatar. Instead, we propose an alternate SDS for Editing (SDS-E) that selectively incorporates subterms of SDS across diffusion timesteps. We further enhance SDS-E with spatial smoothness regularization and gradient-based viewpoint sampling to achieve high-quality edits with sharp and high-fidelity detailing. InstructHumans significantly outperforms existing 3D editing methods, consistent with the initial avatar while faithful to the textual instructions. Project page: https://jyzhu.top/instruct-humans .

4/8/2024

🌿

InstructVid2Vid: Controllable Video Editing with Natural Language Instructions

Bosheng Qin, Juncheng Li, Siliang Tang, Tat-Seng Chua, Yueting Zhuang

We introduce InstructVid2Vid, an end-to-end diffusion-based methodology for video editing guided by human language instructions. Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion. The proposed InstructVid2Vid model modifies a pretrained image generation model, Stable Diffusion, to generate a time-dependent sequence of video frames. By harnessing the collective intelligence of disparate models, we engineer a training dataset rich in video-instruction triplets, which is a more cost-efficient alternative to collecting data in real-world scenarios. To enhance the coherence between successive frames within the generated videos, we propose the Inter-Frames Consistency Loss and incorporate it during the training process. With multimodal classifier-free guidance during the inference stage, the generated videos is able to resonate with both the input video and the accompanying instructions. Experimental results demonstrate that InstructVid2Vid is capable of generating high-quality, temporally coherent videos and performing diverse edits, including attribute editing, background changes, and style transfer. These results underscore the versatility and effectiveness of our proposed method.

5/30/2024

🖼️

EditWorld: Simulating World Dynamics for Instruction-Following Image Editing

Ling Yang, Bohan Zeng, Jiaming Liu, Hong Li, Minghao Xu, Wentao Zhang, Shuicheng Yan

Diffusion models have significantly improved the performance of image editing. Existing methods realize various approaches to achieve high-quality image editing, including but not limited to text control, dragging operation, and mask-and-inpainting. Among these, instruction-based editing stands out for its convenience and effectiveness in following human instructions across diverse scenarios. However, it still focuses on simple editing operations like adding, replacing, or deleting, and falls short of understanding aspects of world dynamics that convey the realistic dynamic nature in the physical world. Therefore, this work, EditWorld, introduces a new editing task, namely world-instructed image editing, which defines and categorizes the instructions grounded by various world scenarios. We curate a new image editing dataset with world instructions using a set of large pretrained models (e.g., GPT-3.5, Video-LLava and SDXL). To enable sufficient simulation of world dynamics for image editing, our EditWorld trains model in the curated dataset, and improves instruction-following ability with designed post-edit strategy. Extensive experiments demonstrate our method significantly outperforms existing editing methods in this new task. Our dataset and code will be available at https://github.com/YangLing0818/EditWorld

6/5/2024

Plasticine3D: 3D Non-Rigid Editing with Text Guidance by Multi-View Embedding Optimization

Yige Chen, Teng Hu, Yizhe Tang, Siyuan Chen, Ang Chen, Ran Yi

With the help of Score Distillation Sampling (SDS) and the rapid development of neural 3D representations, some methods have been proposed to perform 3D editing such as adding additional geometries, or overwriting textures. However, generalized 3D non-rigid editing task, which requires changing both the structure (posture or composition) and appearance (texture) of the original object, remains to be challenging in 3D editing field. In this paper, we propose Plasticine3D, a novel text-guided fine-grained controlled 3D editing pipeline that can perform 3D non-rigid editing with large structure deformations. Our work divides the editing process into a geometry editing stage and a texture editing stage to achieve separate control of structure and appearance. In order to maintain the details of the original object from different viewpoints, we propose a Multi-View-Embedding (MVE) Optimization strategy to ensure that the guidance model learns the features of the original object from various viewpoints. For the purpose of fine-grained control, we propose Embedding-Fusion (EF) to blend the original characteristics with the editing objectives in the embedding space, and control the extent of editing by adjusting the fusion rate. Furthermore, in order to address the issue of gradual loss of details during the generation process under high editing intensity, as well as the problem of insignificant editing effects in some scenarios, we propose Score Projection Sampling (SPS) as a replacement of score distillation sampling, which introduces additional optimization phases for editing target enhancement and original detail maintenance, leading to better editing quality. Extensive experiments demonstrate the effectiveness of our method on 3D non-rigid editing tasks

7/10/2024