MegaScenes: Scene-Level View Synthesis at Scale

Read original: arXiv:2406.11819 - Published 8/23/2024 by Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, Noah Snavely
Total Score

0

MegaScenes: Scene-Level View Synthesis at Scale

Sign in to get full access

or

If you already have an account, we'll log you in

Overview

  • Novel view synthesis of complex real-world scenes at scale
  • Pose-conditioned diffusion models to generate realistic new views from a single input image
  • Large-scale dataset of diverse Internet photo collections to train the model

Plain English Explanation

This research presents a new approach for generating realistic new views of complex real-world scenes from a single input image. The key innovation is the use of pose-conditioned diffusion models, which can learn to synthesize novel views of a scene based on the camera pose relative to the input image.

The researchers collected a large-scale dataset of diverse Internet photo collections, called MegaScenes, to train their model. This dataset captures the wide variety of scenes and viewpoints that exist in the real world. By learning from this diverse data, the model can generate plausible new views of scenes it has never seen before.

The ability to synthesize novel views of a scene from a single input image has many potential applications, such as PolyOculus: Simultaneous Multi-View Image-Based Novel View Synthesis, We-GS-Wild: Efficient 3D Gaussian Representation, Incremental Joint Learning of Depth, Pose, and Implicit Scene, SGD: Street View Synthesis with Gaussian Splatting Diffusion, and Self-Calibrating 4D: Novel View Synthesis from. This could enable new and more immersive experiences in fields like virtual reality, digital content creation, and autonomous navigation.

Technical Explanation

The core of the MegaScenes system is a pose-conditioned diffusion model that can generate novel views of a scene given a single input image and the desired camera pose. Diffusion models are a type of generative model that learn to transform random noise into realistic images by iteratively adding and removing noise.

In the MegaScenes model, the diffusion process is conditioned on the camera pose relative to the input image. This allows the model to learn the correspondence between the input view and the desired output view, and generate a coherent new image that matches the specified pose.

The researchers trained this model on the large-scale MegaScenes dataset, which contains over 100 million diverse Internet photos spanning a wide range of real-world scenes and viewpoints. By learning from this vast and varied data, the model can generalize to synthesize plausible new views of scenes it has never encountered before.

Critical Analysis

The MegaScenes approach represents a significant advance in the field of novel view synthesis, demonstrating the ability to generate realistic new views of complex real-world scenes at a scale not previously achieved. However, the paper also acknowledges some key limitations and areas for future work.

One notable limitation is that the model is trained on 2D images, rather than 3D scene representations. This means the generated views may not fully capture the 3D structure and occlusions of the original scene. Incorporating 3D information, as explored in Incremental Joint Learning of Depth, Pose, and Implicit Scene, could potentially improve the fidelity of the synthesized views.

Additionally, the paper suggests that the model's performance may degrade when faced with highly complex or unusual scenes that are underrepresented in the training data. Addressing this limitation and improving the model's ability to generalize to a wider range of scenes is an important area for future research.

Conclusion

The MegaScenes research represents a significant advance in the field of novel view synthesis, demonstrating the ability to generate realistic new views of complex real-world scenes at an unprecedented scale. By leveraging pose-conditioned diffusion models and a large-scale dataset of diverse Internet photos, the system can synthesize plausible new perspectives of scenes it has never encountered before.

While the current approach has some limitations, such as the lack of 3D scene information and potential difficulty with complex or unusual scenes, the insights and techniques presented in this paper could have far-reaching implications for applications like virtual reality, digital content creation, and autonomous navigation. As the field of view synthesis continues to evolve, the MegaScenes model and its successors could play a crucial role in enabling more immersive and realistic experiences for users.



This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

MegaScenes: Scene-Level View Synthesis at Scale
Total Score

0

MegaScenes: Scene-Level View Synthesis at Scale

Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, Noah Snavely

Scene-level novel view synthesis (NVS) is fundamental to many vision and graphics applications. Recently, pose-conditioned diffusion models have led to significant progress by extracting 3D information from 2D foundation models, but these methods are limited by the lack of scene-level training data. Common dataset choices either consist of isolated objects (Objaverse), or of object-centric scenes with limited pose distributions (DTU, CO3D). In this paper, we create a large-scale scene-level dataset from Internet photo collections, called MegaScenes, which contains over 100K structure from motion (SfM) reconstructions from around the world. Internet photos represent a scalable data source but come with challenges such as lighting and transient objects. We address these issues to further create a subset suitable for the task of NVS. Additionally, we analyze failure cases of state-of-the-art NVS methods and significantly improve generation consistency. Through extensive experiments, we validate the effectiveness of both our dataset and method on generating in-the-wild scenes. For details on the dataset and code, see our project page at https://megascenes.github.io.

Read more

8/23/2024

🛸

Total Score

0

PolyOculus: Simultaneous Multi-view Image-based Novel View Synthesis

Jason J. Yu, Tristan Aumentado-Armstrong, Fereshteh Forghani, Konstantinos G. Derpanis, Marcus A. Brubaker

This paper considers the problem of generative novel view synthesis (GNVS), generating novel, plausible views of a scene given a limited number of known views. Here, we propose a set-based generative model that can simultaneously generate multiple, self-consistent new views, conditioned on any number of views. Our approach is not limited to generating a single image at a time and can condition on a variable number of views. As a result, when generating a large number of views, our method is not restricted to a low-order autoregressive generation approach and is better able to maintain generated image quality over large sets of images. We evaluate our model on standard NVS datasets and show that it outperforms the state-of-the-art image-based GNVS baselines. Further, we show that the model is capable of generating sets of views that have no natural sequential ordering, like loops and binocular trajectories, and significantly outperforms other methods on such tasks.

Read more

7/29/2024

Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance
Total Score

0

Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance

Taewon Kang, Divya Kothandaraman, Dinesh Manocha, Ming C. Lin

Recent 3D novel view synthesis (NVS) methods are limited to single-object-centric scenes generated from new viewpoints and struggle with complex environments. They often require extensive 3D data for training, lacking generalization beyond training distribution. Conversely, 3D-free methods can generate text-controlled views of complex, in-the-wild scenes using a pretrained stable diffusion model without tedious fine-tuning, but lack camera control. In this paper, we introduce HawkI++, a method capable of generating camera-controlled viewpoints from a single input image. HawkI++ excels in handling complex and diverse scenes without additional 3D data or extensive training. It leverages widely available pretrained NVS models for weak guidance, integrating this knowledge into a 3D-free view synthesis approach to achieve the desired results efficiently. Our experimental results demonstrate that HawkI++ outperforms existing models in both qualitative and quantitative evaluations, providing high-fidelity and consistent novel view synthesis at desired camera angles across a wide variety of scenes.

Read more

8/13/2024

WE-GS: An In-the-wild Efficient 3D Gaussian Representation for Unconstrained Photo Collections
Total Score

0

WE-GS: An In-the-wild Efficient 3D Gaussian Representation for Unconstrained Photo Collections

Yuze Wang, Junyi Wang, Yue Qi

Novel View Synthesis (NVS) from unconstrained photo collections is challenging in computer graphics. Recently, 3D Gaussian Splatting (3DGS) has shown promise for photorealistic and real-time NVS of static scenes. Building on 3DGS, we propose an efficient point-based differentiable rendering framework for scene reconstruction from photo collections. Our key innovation is a residual-based spherical harmonic coefficients transfer module that adapts 3DGS to varying lighting conditions and photometric post-processing. This lightweight module can be pre-computed and ensures efficient gradient propagation from rendered images to 3D Gaussian attributes. Additionally, we observe that the appearance encoder and the transient mask predictor, the two most critical parts of NVS from unconstrained photo collections, can be mutually beneficial. We introduce a plug-and-play lightweight spatial attention module to simultaneously predict transient occluders and latent appearance representation for each image. After training and preprocessing, our method aligns with the standard 3DGS format and rendering pipeline, facilitating seamlessly integration into various 3DGS applications. Extensive experiments on diverse datasets show our approach outperforms existing approaches on the rendering quality of novel view and appearance synthesis with high converge and rendering speed.

Read more

6/5/2024