Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting

2404.19758

Published 5/1/2024 by Paul Engstler, Andrea Vedaldi, Iro Laina, Christian Rupprecht

⚙️

Abstract

3D scene generation has quickly become a challenging new research direction, fueled by consistent improvements of 2D generative diffusion models. Most prior work in this area generates scenes by iteratively stitching newly generated frames with existing geometry. These works often depend on pre-trained monocular depth estimators to lift the generated images into 3D, fusing them with the existing scene representation. These approaches are then often evaluated via a text metric, measuring the similarity between the generated images and a given text prompt. In this work, we make two fundamental contributions to the field of 3D scene generation. First, we note that lifting images to 3D with a monocular depth estimation model is suboptimal as it ignores the geometry of the existing scene. We thus introduce a novel depth completion model, trained via teacher distillation and self-training to learn the 3D fusion process, resulting in improved geometric coherence of the scene. Second, we introduce a new benchmarking scheme for scene generation methods that is based on ground truth geometry, and thus measures the quality of the structure of the scene.

Create account to get full access

Overview

This research paper explores a novel approach to 3D scene generation, a growing field driven by advancements in 2D generative diffusion models.
Prior methods often generate scenes by stitching newly created frames with existing geometry, relying on pre-trained monocular depth estimators to convert the 2D images into 3D.
These approaches are then typically evaluated using text-based metrics that measure the similarity between the generated images and a given prompt.
The key contributions of this work are:
1. Introducing a new depth completion model that learns the 3D fusion process, resulting in more geometrically coherent scenes.
2. Proposing a new benchmarking scheme that evaluates the quality of the scene's structure based on ground truth geometry.

Plain English Explanation

Generating 3D scenes is a rapidly evolving field of research, driven by improvements in 2D image generation techniques. Most prior work has focused on creating 3D scenes by combining newly generated 2D frames with existing 3D geometry. These methods often use pre-trained depth estimation models to convert the 2D images into 3D, and then evaluate the results based on how well the generated images match a given text description.

In this research, the authors identify two key limitations of this approach. First, they note that using a monocular depth estimator to convert the 2D images is suboptimal, as it doesn't take into account the existing geometry of the scene. To address this, the researchers introduce a new depth completion model that learns to fuse the 2D images with the 3D geometry in a more coherent way. This model is trained using a technique called 'teacher distillation' and 'self-training', which helps it better understand the 3D structure of the scene.

Second, the authors propose a new way to evaluate 3D scene generation methods, focusing on the quality of the scene's structure rather than just the similarity to a text prompt. This new benchmark measures how well the generated scenes match the ground truth 3D geometry, providing a more objective assessment of the method's ability to create plausible 3D environments.

By addressing these issues, the researchers are advancing the field of 3D scene generation, which has important applications in areas like virtual reality, video game development, and architectural design. Their work demonstrates how combining 2D image generation with 3D geometry can lead to more realistic and coherent 3D scenes.

Technical Explanation

The key technical contributions of this work are a novel depth completion model and a new benchmarking scheme for 3D scene generation.

The depth completion model is trained using a combination of 'teacher distillation' and 'self-training'. Teacher distillation involves using the outputs of a pre-trained depth estimation model as a form of supervision for the new model, helping it learn to better understand the 3D structure of the scene. Self-training then allows the model to further refine its predictions by iteratively improving on its own outputs.

This depth completion model is then used to fuse the newly generated 2D images with the existing 3D geometry, resulting in scenes with improved geometric coherence compared to prior methods that relied solely on monocular depth estimation.

To evaluate the quality of the generated 3D scenes, the authors introduce a new benchmark that measures how well the scenes match the ground truth 3D geometry, rather than just text-based similarity. This provides a more objective assessment of the method's ability to create plausible 3D environments.

Critical Analysis

The researchers have identified important limitations in prior 3D scene generation approaches and proposed novel solutions to address them. The depth completion model's use of teacher distillation and self-training is a promising approach for improving the 3D fusion process, though the specific implementation details and architectural choices are not explored in depth.

One potential area for further research could be investigating the impact of different depth estimation models or fusion strategies on the final scene quality, as discussed in 'Behind the Veil: Enhanced Indoor 3D Scene Reconstruction'. Additionally, the new benchmarking scheme, while an important contribution, may not capture all relevant aspects of scene quality, such as semantic coherence or visual plausibility.

Overall, this work represents a significant advancement in the field of 3D scene generation and highlights the importance of considering the underlying 3D geometry when creating new 3D content. The researchers have demonstrated the value of their approach through rigorous experimentation and thoughtful evaluation, setting the stage for further progress in this rapidly evolving area of research.

Conclusion

This research paper introduces two key innovations in the field of 3D scene generation. First, it presents a novel depth completion model that learns to better fuse newly generated 2D images with existing 3D geometry, resulting in more geometrically coherent scenes. Second, it proposes a new benchmarking scheme that evaluates the quality of the generated 3D scenes based on ground truth geometry, providing a more objective assessment of the methods' capabilities.

By addressing the limitations of prior 3D scene generation approaches, this work represents a significant step forward in the field. The depth completion model's use of teacher distillation and self-training, along with the new benchmarking scheme, offer valuable insights and tools for researchers and developers working on creating realistic and visually compelling 3D environments. As the field of 3D scene generation continues to evolve, this research will undoubtedly inform and inspire future advancements in this important area of study.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion

Jaidev Shriram, Alex Trevithick, Lingjie Liu, Ravi Ramamoorthi

We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require video or multi-view data and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.

4/11/2024

cs.CV cs.AI cs.GR cs.LG

RefFusion: Reference Adapted Diffusion Models for 3D Scene Inpainting

Ashkan Mirzaei, Riccardo De Lutio, Seung Wook Kim, David Acuna, Jonathan Kelly, Sanja Fidler, Igor Gilitschenski, Zan Gojcic

Neural reconstruction approaches are rapidly emerging as the preferred representation for 3D scenes, but their limited editability is still posing a challenge. In this work, we propose an approach for 3D scene inpainting -- the task of coherently replacing parts of the reconstructed scene with desired content. Scene inpainting is an inherently ill-posed task as there exist many solutions that plausibly replace the missing content. A good inpainting method should therefore not only enable high-quality synthesis but also a high degree of control. Based on this observation, we focus on enabling explicit control over the inpainted content and leverage a reference image as an efficient means to achieve this goal. Specifically, we introduce RefFusion, a novel 3D inpainting method based on a multi-scale personalization of an image inpainting diffusion model to the given reference view. The personalization effectively adapts the prior distribution to the target scene, resulting in a lower variance of score distillation objective and hence significantly sharper details. Our framework achieves state-of-the-art results for object removal while maintaining high controllability. We further demonstrate the generality of our formulation on other downstream tasks such as object insertion, scene outpainting, and sparse view reconstruction.

4/17/2024

cs.CV

DoubleTake: Geometry Guided Depth Estimation

Mohamed Sayed, Filippo Aleotti, Jamie Watson, Zawar Qureshi, Guillermo Garcia-Hernando, Gabriel Brostow, Sara Vicente, Michael Firman

Estimating depth from a sequence of posed RGB images is a fundamental computer vision task, with applications in augmented reality, path planning etc. Prior work typically makes use of previous frames in a multi view stereo framework, relying on matching textures in a local neighborhood. In contrast, our model leverages historical predictions by giving the latest 3D geometry data as an extra input to our network. This self-generated geometric hint can encode information from areas of the scene not covered by the keyframes and it is more regularized when compared to individual predicted depth maps for previous frames. We introduce a Hint MLP which combines cost volume features with a hint of the prior geometry, rendered as a depth map from the current camera location, together with a measure of the confidence in the prior geometry. We demonstrate that our method, which can run at interactive speeds, achieves state-of-the-art estimates of depth and 3D scene reconstruction in both offline and incremental evaluation scenarios.

6/27/2024

cs.CV cs.LG

DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting

Shijie Zhou, Zhiwen Fan, Dejia Xu, Haoran Chang, Pradyumna Chari, Tejas Bharadwaj, Suya You, Zhangyang Wang, Achuta Kadambi

The increasing demand for virtual reality applications has highlighted the significance of crafting immersive 3D assets. We present a text-to-3D 360$^{circ}$ scene generation pipeline that facilitates the creation of comprehensive 360$^{circ}$ scenes for in-the-wild environments in a matter of minutes. Our approach utilizes the generative power of a 2D diffusion model and prompt self-refinement to create a high-quality and globally coherent panoramic image. This image acts as a preliminary flat (2D) scene representation. Subsequently, it is lifted into 3D Gaussians, employing splatting techniques to enable real-time exploration. To produce consistent 3D geometry, our pipeline constructs a spatially coherent structure by aligning the 2D monocular depth into a globally optimized point cloud. This point cloud serves as the initial state for the centroids of 3D Gaussians. In order to address invisible issues inherent in single-view inputs, we impose semantic and geometric constraints on both synthesized and input camera views as regularizations. These guide the optimization of Gaussians, aiding in the reconstruction of unseen regions. In summary, our method offers a globally consistent 3D scene within a 360$^{circ}$ perspective, providing an enhanced immersive experience over existing techniques. Project website at: http://dreamscene360.github.io/

4/11/2024

cs.CV cs.AI