MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

Read original: arXiv:2408.10605 - Published 8/22/2024 by Yanbo Ding, Shaobin Zhuang, Kunchang Li, Zhengrong Yue, Yu Qiao, Yali Wang

🖼️

Overview

The paper introduces MUSES, a generic AI system for 3D-controllable image generation from user queries.
MUSES addresses the challenge of creating images with multiple objects and complex spatial relationships in 3D.
The system uses a progressive workflow with three key components: Layout Manager, Model Engineer, and Image Artist.
The authors also construct a new benchmark, T2I-3DisBench, to evaluate 3D image scene generation.
Experiments show MUSES outperforms recent strong competitors in text-to-image generation tasks.

Plain English Explanation

The paper presents a new AI system called MUSES that can create images with 3D-controllable objects and complex spatial relationships. Most existing text-to-image generation methods struggle with this task, but MUSES addresses it by breaking the process into three key steps:

Layout Manager: This component takes the user's text prompt and converts it into a 2D layout of the objects.
Model Engineer: This component acquires the 3D models of the objects and calibrates them to fit the layout.
Image Artist: This component renders the 3D scene into a final 2D image.

By mimicking how human professionals work together, this multi-modal pipeline allows MUSES to automatically generate images with precise 3D control, starting from just a text description.

The authors also recognized that existing benchmarks lacked detailed descriptions of complex 3D scenes, so they created a new benchmark called T2I-3DisBench to fill this gap. Evaluations show MUSES outperforms recent strong competitors like DALL-E 3 and Stable Diffusion 3 on both this new benchmark and existing ones.

Technical Explanation

The core of MUSES is a progressive workflow with three key components:

Layout Manager: This module takes the text prompt and generates a 2D layout of the objects, determining their positions and sizes. It uses a language model to understand the spatial relationships described in the prompt and then plans the 2D arrangement.
Model Engineer: This module acquires the 3D models of the objects needed and calibrates them to fit the 2D layout. It retrieves appropriate 3D models from a database and then scales, rotates, and positions them in 3D space.
Image Artist: This final module renders the 3D scene into a 2D image. It uses techniques like view synthesis and texture mapping to create the final visual output.

By integrating these three components, MUSES can automatically generate images with precise 3D control, bridging natural language, 2D image generation, and 3D world modeling.

The authors also introduce a new benchmark, T2I-3DisBench, which provides detailed descriptions of complex 3D image scenes. This fills a gap in existing benchmarks and allows for more comprehensive evaluation of 3D text-to-image generation systems.

Critical Analysis

The paper makes a valuable contribution by addressing the challenge of generating images with multiple objects and complex 3D spatial relationships. The progressive workflow of MUSES is an interesting approach that mimics how human professionals collaborate.

However, the paper does not discuss potential limitations or areas for further research. For example, it's unclear how well MUSES would handle highly abstract or fantastical prompts, or whether the system could be extended to generate videos or animations.

Additionally, the reliance on a pre-existing 3D model database may limit the system's versatility and ability to handle novel or unique objects. The authors could explore ways to integrate 3D object generation or reconstruction capabilities into the pipeline.

Overall, the research represents an important step forward in text-to-image generation, but there are opportunities to further expand the capabilities and robustness of the system.

Conclusion

The paper introduces MUSES, a generic AI system for 3D-controllable image generation from text prompts. By breaking the process into a progressive workflow with Layout Manager, Model Engineer, and Image Artist components, MUSES can effectively create images with multiple objects and complex 3D spatial relationships.

The authors also construct a new benchmark, T2I-3DisBench, to evaluate 3D image scene generation, addressing a gap in existing evaluation tools. Experiments show MUSES outperforming recent strong competitors, demonstrating its potential to bridge natural language, 2D image generation, and 3D world modeling.

This research represents a significant advance in text-to-image generation, paving the way for more immersive and controllable visual outputs from natural language inputs.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🖼️

MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

Yanbo Ding, Shaobin Zhuang, Kunchang Li, Zhengrong Yue, Yu Qiao, Yali Wang

Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries. Specifically, our MUSES addresses this challenging task by developing a progressive workflow with three key components, including (1) Layout Manager for 2D-to-3D layout lifting, (2) Model Engineer for 3D object acquisition and calibration, (3) Image Artist for 3D-to-2D image rendering. By mimicking the collaboration of human professionals, this multi-modal agent pipeline facilitates the effective and automatic creation of images with 3D-controllable objects, through an explainable integration of top-down planning and bottom-up generation. Additionally, we find that existing benchmarks lack detailed descriptions of complex 3D spatial relationships of multiple objects. To fill this gap, we further construct a new benchmark of T2I-3DisBench (3D image scene), which describes diverse 3D image scenes with 50 detailed prompts. Extensive experiments show the state-of-the-art performance of MUSES on both T2I-CompBench and T2I-3DisBench, outperforming recent strong competitors such as DALL-E 3 and Stable Diffusion 3. These results demonstrate a significant step of MUSES forward in bridging natural language, 2D image generation, and 3D world.

8/22/2024

Idea-2-3D: Collaborative LMM Agents Enable 3D Model Generation from Interleaved Multimodal Inputs

Junhao Chen, Xiang Li, Xiaojun Ye, Chao Li, Zhaoxin Fan, Hao Zhao

In this paper, we pursue a novel 3D AIGC setting: generating 3D content from IDEAs. The definition of an IDEA is the composition of multimodal inputs including text, image, and 3D models. To our knowledge, this challenging and appealing 3D AIGC setting has not been studied before. We propose the novel framework called Idea-2-3D to achieve this goal, which consists of three agents based upon large multimodel models (LMMs) and several existing algorithmic tools for them to invoke. Specifically, these three LMM-based agents are prompted to do the jobs of prompt generation, model selection and feedback reflection. They work in a cycle that involves both mutual collaboration and criticism. Note that this cycle is done in a fully automatic manner, without any human intervention. The framework then outputs a text prompt to generate 3D models that well align with input IDEAs. We show impressive 3D AIGC results that are beyond any previous methods can achieve. For quantitative comparisons, we construct caption-based baselines using a whole bunch of state-of-the-art 3D AIGC models and demonstrate Idea-2-3D out-performs significantly. In 94.2% of cases, Idea-2-3D meets users' requirements, marking a degree of match between IDEA and 3D models that is 2.3 times higher than baselines. Moreover, in 93.5% of the cases, users agreed that Idea-2-3D was better than baselines. Codes, data and models will made publicly available.

4/9/2024

⛏️

MUSES: The Multi-Sensor Semantic Perception Dataset for Driving under Uncertainty

Tim Brodermann, David Bruggemann, Christos Sakaridis, Kevin Ta, Odysseas Liagouris, Jason Corkill, Luc Van Gool

Achieving level-5 driving automation in autonomous vehicles necessitates a robust semantic visual perception system capable of parsing data from different sensors across diverse conditions. However, existing semantic perception datasets often lack important non-camera modalities typically used in autonomous vehicles, or they do not exploit such modalities to aid and improve semantic annotations in challenging conditions. To address this, we introduce MUSES, the MUlti-SEnsor Semantic perception dataset for driving in adverse conditions under increased uncertainty. MUSES includes synchronized multimodal recordings with 2D panoptic annotations for 2500 images captured under diverse weather and illumination. The dataset integrates a frame camera, a lidar, a radar, an event camera, and an IMU/GNSS sensor. Our new two-stage panoptic annotation protocol captures both class-level and instance-level uncertainty in the ground truth and enables the novel task of uncertainty-aware panoptic segmentation we introduce, along with standard semantic and panoptic segmentation. MUSES proves both effective for training and challenging for evaluating models under diverse visual conditions, and it opens new avenues for research in multimodal and uncertainty-aware dense semantic perception. Our dataset and benchmark are publicly available at https://muses.vision.ee.ethz.ch.

7/18/2024

MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

William Berman, Alexander Peysakhovich

We train a model to generate images from multimodal prompts of interleaved text and images such as a man and his dog in an animated style. We bootstrap a multimodal dataset by extracting semantically meaningful image crops corresponding to words in the image captions of synthetically generated and publicly available text-image data. Our model, MUMU, is composed of a vision-language model encoder with a diffusion decoder and is trained on a single 8xH100 GPU node. Despite being only trained on crops from the same image, MUMU learns to compose inputs from different images into a coherent output. For example, an input of a realistic person and a cartoon will output the same person in the cartoon style, and an input of a standing subject and a scooter will output the subject riding the scooter. As a result, our model generalizes to tasks such as style transfer and character consistency. Our results show the promise of using multimodal models as general purpose controllers for image generation.

9/14/2024