RefFusion: Reference Adapted Diffusion Models for 3D Scene Inpainting

2404.10765

Published 4/17/2024 by Ashkan Mirzaei, Riccardo De Lutio, Seung Wook Kim, David Acuna, Jonathan Kelly, Sanja Fidler, Igor Gilitschenski, Zan Gojcic

cs.CV

RefFusion: Reference Adapted Diffusion Models for 3D Scene Inpainting

Abstract

Neural reconstruction approaches are rapidly emerging as the preferred representation for 3D scenes, but their limited editability is still posing a challenge. In this work, we propose an approach for 3D scene inpainting -- the task of coherently replacing parts of the reconstructed scene with desired content. Scene inpainting is an inherently ill-posed task as there exist many solutions that plausibly replace the missing content. A good inpainting method should therefore not only enable high-quality synthesis but also a high degree of control. Based on this observation, we focus on enabling explicit control over the inpainted content and leverage a reference image as an efficient means to achieve this goal. Specifically, we introduce RefFusion, a novel 3D inpainting method based on a multi-scale personalization of an image inpainting diffusion model to the given reference view. The personalization effectively adapts the prior distribution to the target scene, resulting in a lower variance of score distillation objective and hence significantly sharper details. Our framework achieves state-of-the-art results for object removal while maintaining high controllability. We further demonstrate the generality of our formulation on other downstream tasks such as object insertion, scene outpainting, and sparse view reconstruction.

Create account to get full access

Overview

This paper presents RefFusion, a reference-adapted diffusion model for 3D scene inpainting.
The model leverages reference images to guide the diffusion process and generate coherent, realistic 3D scene completions.
The approach combines a diffusion-based 3D scene generation model with a reference-based adaptation mechanism.

Plain English Explanation

RefFusion is a new AI model that can fill in missing parts of 3D scenes. It works by using "reference" images as a guide to generate realistic and coherent 3D scene completions.

Traditionally, 3D scene inpainting (the process of filling in missing parts of a 3D scene) has been challenging because it requires understanding the underlying structure and semantics of the scene. RefFusion tackles this by combining a powerful 3D scene generation model with a "reference-based" approach.

The key idea is that by providing the model with relevant reference images, it can better understand the context and fill in the missing parts in a way that matches the style and content of the existing scene. This is like a human artist using reference images to help them sketch a realistic scene.

The model works by starting with a corrupted or incomplete 3D scene and progressively refining it, guided by the information in the reference images. This diffusion-based approach allows the model to generate high-quality, coherent 3D scene completions.

Overall, RefFusion represents an important advance in 3D scene understanding and inpainting, with potential applications in areas like virtual reality, game development, and 3D content creation.

Technical Explanation

The core of RefFusion is a diffusion-based 3D scene generation model that is adapted to use reference images to guide the completion process. The model takes as input a partial 3D scene and a set of reference images, and outputs a completed 3D scene.

The key technical components are:

3D Scene Diffusion Model: This is a pre-trained model that can generate 3D scenes from noise using a diffusion process. It learns to gradually transform random noise into coherent 3D scenes by following a learned sequence of denoising steps.
Reference Adaptation Module: This module takes the partial 3D scene and the reference images and learns to modulate the diffusion process to match the style and content of the references. This is done through a series of learned transformations that inject the reference information into the diffusion steps.
Training and Inference: The model is trained end-to-end using a combination of 3D scene reconstruction and adversarial losses. At inference time, the model takes a partial 3D scene and a set of reference images, and iteratively refines the scene completion through the diffusion process.

The experiments show that RefFusion outperforms previous state-of-the-art 3D scene inpainting methods, both in terms of visual quality and semantic consistency with the reference images. The model is able to hallucinate missing scene elements in a plausible way, guided by the provided references.

Critical Analysis

One limitation of RefFusion is that it relies on the availability of suitable reference images to guide the inpainting process. In some cases, finding appropriate reference images may be challenging, especially for complex or unusual scenes. The performance of the model may degrade if the references do not closely match the content and style of the target scene.

Additionally, the diffusion-based approach used in RefFusion can be computationally expensive, as it requires iteratively refining the scene completion over many steps. This may limit the practicality of the model for real-time applications or resource-constrained environments.

Further research could explore ways to make the reference adaptation more robust, such as learning to generate or retrieve suitable references automatically. Optimizing the diffusion process for efficiency could also improve the model's practical applicability.

Overall, RefFusion represents a promising step forward in 3D scene inpainting, demonstrating the potential of leveraging reference information to guide the generation of coherent and realistic scene completions.

Conclusion

In this paper, the authors introduce RefFusion, a reference-adapted diffusion model for 3D scene inpainting. The key innovation is the use of reference images to guide the diffusion-based 3D scene generation process, leading to more coherent and realistic scene completions.

The technical approach combines a pre-trained diffusion-based 3D scene generation model with a reference adaptation module that learns to modulate the diffusion process to match the style and content of the provided references. Experiments show that RefFusion outperforms previous state-of-the-art methods in 3D scene inpainting.

While the reliance on reference images and the computational complexity of the diffusion process are potential limitations, RefFusion represents an important advance in 3D scene understanding and manipulation. The ability to generate realistic 3D scene completions has numerous applications in areas such as virtual reality, game development, and 3D content creation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Diffusion-based image inpainting with internal learning

Nicolas Cherel, Andr'es Almansa, Yann Gousseau, Alasdair Newson

Diffusion models are now the undisputed state-of-the-art for image generation and image restoration. However, they require large amounts of computational power for training and inference. In this paper, we propose lightweight diffusion models for image inpainting that can be trained on a single image, or a few images. We show that our approach competes with large state-of-the-art models in specific cases. We also show that training a model on a single image is particularly relevant for image acquisition modality that differ from the RGB images of standard learning databases. We show results in three different contexts: texture images, line drawing images, and materials BRDF, for which we achieve state-of-the-art results in terms of realism, with a computational load that is greatly reduced compared to concurrent methods.

6/7/2024

cs.CV

⚙️

Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting

Paul Engstler, Andrea Vedaldi, Iro Laina, Christian Rupprecht

3D scene generation has quickly become a challenging new research direction, fueled by consistent improvements of 2D generative diffusion models. Most prior work in this area generates scenes by iteratively stitching newly generated frames with existing geometry. These works often depend on pre-trained monocular depth estimators to lift the generated images into 3D, fusing them with the existing scene representation. These approaches are then often evaluated via a text metric, measuring the similarity between the generated images and a given text prompt. In this work, we make two fundamental contributions to the field of 3D scene generation. First, we note that lifting images to 3D with a monocular depth estimation model is suboptimal as it ignores the geometry of the existing scene. We thus introduce a novel depth completion model, trained via teacher distillation and self-training to learn the 3D fusion process, resulting in improved geometric coherence of the scene. Second, we introduce a new benchmarking scheme for scene generation methods that is based on ground truth geometry, and thus measures the quality of the structure of the scene.

5/1/2024

cs.CV

FaithFill: Faithful Inpainting for Object Completion Using a Single Reference Image

Rupayan Mallick, Amr Abdalla, Sarah Adel Bargal

We present FaithFill, a diffusion-based inpainting object completion approach for realistic generation of missing object parts. Typically, multiple reference images are needed to achieve such realistic generation, otherwise the generation would not faithfully preserve shape, texture, color, and background. In this work, we propose a pipeline that utilizes only a single input reference image -having varying lighting, background, object pose, and/or viewpoint. The singular reference image is used to generate multiple views of the object to be inpainted. We demonstrate that FaithFill produces faithful generation of the object's missing parts, together with background/scene preservation, from a single reference image. This is demonstrated through standard similarity metrics, human judgement, and GPT evaluation. Our results are presented on the DreamBooth dataset, and a novel proposed dataset.

6/13/2024

cs.CV cs.AI cs.LG

📶

Semantically Consistent Video Inpainting with Conditional Diffusion Models

Dylan Green, William Harvey, Saeid Naderiparizi, Matthew Niedoba, Yunpeng Liu, Xiaoxuan Liang, Jonathan Lavington, Ke Zhang, Vasileios Lioutas, Setareh Dabiri, Adam Scibior, Berend Zwartsenberg, Frank Wood

Current state-of-the-art methods for video inpainting typically rely on optical flow or attention-based approaches to inpaint masked regions by propagating visual information across frames. While such approaches have led to significant progress on standard benchmarks, they struggle with tasks that require the synthesis of novel content that is not present in other frames. In this paper we reframe video inpainting as a conditional generative modeling problem and present a framework for solving such problems with conditional video diffusion models. We highlight the advantages of using a generative approach for this task, showing that our method is capable of generating diverse, high-quality inpaintings and synthesizing new content that is spatially, temporally, and semantically consistent with the provided context.

5/2/2024

cs.CV cs.LG