Fast View Synthesis of Casual Videos with Soup-of-Planes

Read original: arXiv:2312.02135 - Published 7/22/2024 by Yao-Chih Lee, Zhoutong Zhang, Kevin Blackburn-Matzen, Simon Niklaus, Jianming Zhang, Jia-Bin Huang, Feng Liu

👀

Overview

Novel view synthesis from a monocular video is challenging due to scene dynamics and lack of parallax
Existing methods using implicit neural radiance fields are slow to train and render
This paper proposes an efficient approach to synthesize high-quality novel views from a monocular video

Plain English Explanation

This paper presents a new method for synthesizing novel views from a single video. Generating realistic novel views from a video is difficult because the scenes often have dynamic elements that change over time, and there is limited information about the 3D structure of the scene (known as "parallax").

Previous approaches have used implicit neural radiance fields to model the scene, but these are slow to train and render. Instead, this paper takes an "explicit" approach, where the static and dynamic parts of the scene are represented separately.

The static scene is modeled using an extended plane-based representation, which captures view-dependent effects and complex surface geometry. The dynamic content is represented as per-frame point clouds, which are efficient but can lead to some minor inconsistencies over time.

The key insight is that these small inconsistencies are masked by the motion in the scene, allowing for efficient rendering of high-quality novel views in real-time. Experiments show this new approach can match the quality of state-of-the-art methods while being 100 times faster to train.

Technical Explanation

The core of the proposed method is a hybrid video representation that treats the static and dynamic components of the scene separately.

For the static scene, an extended plane-based representation is used. This represents the 3D structure of the environment as a set of planar surfaces, augmented with spherical harmonics to capture view-dependent lighting effects and displacement maps to model non-planar complex geometry.

The dynamic content is represented as per-frame point clouds, which are efficient but can lead to minor temporal inconsistencies. However, the authors argue that these small inconsistencies are perceptually masked by the motion in the scene.

The method quickly estimates this hybrid representation from the input video and then renders novel views in real-time by compositing the static and dynamic components. Key to this efficiency is the explicit nature of the representation, in contrast to the implicit neural radiance fields used in previous state-of-the-art methods, which are slow to train and render.

The authors demonstrate that their approach can synthesize high-quality novel views from in-the-wild videos, matching the quality of recent state-of-the-art methods while being 100x faster in training and enabling real-time rendering.

Critical Analysis

The paper presents a novel and efficient approach to novel view synthesis from monocular videos, addressing key limitations of prior work. However, there are a few potential caveats and areas for further research:

The use of per-frame point clouds for dynamic content may lead to some residual temporal inconsistencies, even if these are perceptually masked. Exploring more temporally coherent dynamic representations could further improve quality.
The paper focuses on static and dynamic scene components, but does not explicitly model occlusions between them. Handling occlusions more robustly could enhance the realism of the synthesized novel views.
While the method is efficient compared to implicit neural radiance field approaches, the authors do not provide a detailed comparison of memory and computational requirements. Exploring ways to further streamline the representation and rendering could unlock even more real-world applications.

Overall, this paper presents a promising direction for efficient and high-quality novel view synthesis from monocular videos, and the ideas could inspire future research in this area.

Conclusion

This paper introduces an efficient approach for synthesizing high-quality novel views from a monocular video. By treating the static and dynamic components of the scene separately, the method can quickly estimate a hybrid video representation and render novel views in real-time, overcoming the limitations of previous state-of-the-art methods based on implicit neural radiance fields.

The key technical innovations are the extended plane-based representation for the static scene and the use of per-frame point clouds for the dynamic content. While the latter can lead to minor temporal inconsistencies, the authors show that these are perceptually masked by the scene motion.

Experiments demonstrate that this new approach can match the quality of recent state-of-the-art methods while being 100 times faster in training and enabling real-time rendering. This unlocks the potential for novel view synthesis to be used in a wider range of real-world applications, from virtual reality to photo editing and beyond.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👀

Fast View Synthesis of Casual Videos with Soup-of-Planes

Yao-Chih Lee, Zhoutong Zhang, Kevin Blackburn-Matzen, Simon Niklaus, Jianming Zhang, Jia-Bin Huang, Feng Liu

Novel view synthesis from an in-the-wild video is difficult due to challenges like scene dynamics and lack of parallax. While existing methods have shown promising results with implicit neural radiance fields, they are slow to train and render. This paper revisits explicit video representations to synthesize high-quality novel views from a monocular video efficiently. We treat static and dynamic video content separately. Specifically, we build a global static scene model using an extended plane-based scene representation to synthesize temporally coherent novel video. Our plane-based scene representation is augmented with spherical harmonics and displacement maps to capture view-dependent effects and model non-planar complex surface geometry. We opt to represent the dynamic content as per-frame point clouds for efficiency. While such representations are inconsistency-prone, minor temporal inconsistencies are perceptually masked due to motion. We develop a method to quickly estimate such a hybrid video representation and render novel views in real time. Our experiments show that our method can render high-quality novel views from an in-the-wild video with comparable quality to state-of-the-art methods while being 100x faster in training and enabling real-time rendering.

7/22/2024

🏋️

A Construct-Optimize Approach to Sparse View Synthesis without Camera Pose

Kaiwen Jiang, Yang Fu, Mukund Varma T, Yash Belhe, Xiaolong Wang, Hao Su, Ravi Ramamoorthi

Novel view synthesis from a sparse set of input images is a challenging problem of great practical interest, especially when camera poses are absent or inaccurate. Direct optimization of camera poses and usage of estimated depths in neural radiance field algorithms usually do not produce good results because of the coupling between poses and depths, and inaccuracies in monocular depth estimation. In this paper, we leverage the recent 3D Gaussian splatting method to develop a novel construct-and-optimize method for sparse view synthesis without camera poses. Specifically, we construct a solution progressively by using monocular depth and projecting pixels back into the 3D world. During construction, we optimize the solution by detecting 2D correspondences between training views and the corresponding rendered images. We develop a unified differentiable pipeline for camera registration and adjustment of both camera poses and depths, followed by back-projection. We also introduce a novel notion of an expected surface in Gaussian splatting, which is critical to our optimization. These steps enable a coarse solution, which can then be low-pass filtered and refined using standard optimization methods. We demonstrate results on the Tanks and Temples and Static Hikes datasets with as few as three widely-spaced views, showing significantly better quality than competing methods, including those with approximate camera pose information. Moreover, our results improve with more views and outperform previous InstantNGP and Gaussian Splatting algorithms even when using half the dataset. Project page: https://raymondjiangkw.github.io/cogs.github.io/

6/12/2024

Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance

Taewon Kang, Divya Kothandaraman, Dinesh Manocha, Ming C. Lin

Recent 3D novel view synthesis (NVS) methods are limited to single-object-centric scenes generated from new viewpoints and struggle with complex environments. They often require extensive 3D data for training, lacking generalization beyond training distribution. Conversely, 3D-free methods can generate text-controlled views of complex, in-the-wild scenes using a pretrained stable diffusion model without tedious fine-tuning, but lack camera control. In this paper, we introduce HawkI++, a method capable of generating camera-controlled viewpoints from a single input image. HawkI++ excels in handling complex and diverse scenes without additional 3D data or extensive training. It leverages widely available pretrained NVS models for weak guidance, integrating this knowledge into a 3D-free view synthesis approach to achieve the desired results efficiently. Our experimental results demonstrate that HawkI++ outperforms existing models in both qualitative and quantitative evaluations, providing high-fidelity and consistent novel view synthesis at desired camera angles across a wide variety of scenes.

8/13/2024

🧪

Decoupling Dynamic Monocular Videos for Dynamic View Synthesis

Meng You, Junhui Hou

The challenge of dynamic view synthesis from dynamic monocular videos, i.e., synthesizing novel views for free viewpoints given a monocular video of a dynamic scene captured by a moving camera, mainly lies in accurately modeling the textbf{dynamic objects} of a scene using limited 2D frames, each with a varying timestamp and viewpoint. Existing methods usually require pre-processed 2D optical flow and depth maps by off-the-shelf methods to supervise the network, making them suffer from the inaccuracy of the pre-processed supervision and the ambiguity when lifting the 2D information to 3D. In this paper, we tackle this challenge in an unsupervised fashion. Specifically, we decouple the motion of the dynamic objects into object motion and camera motion, respectively regularized by proposed unsupervised surface consistency and patch-based multi-view constraints. The former enforces the 3D geometric surfaces of moving objects to be consistent over time, while the latter regularizes their appearances to be consistent across different viewpoints. Such a fine-grained motion formulation can alleviate the learning difficulty for the network, thus enabling it to produce not only novel views with higher quality but also more accurate scene flows and depth than existing methods requiring extra supervision.

8/22/2024