Self-Calibrating 4D Novel View Synthesis from Monocular Videos Using Gaussian Splatting

2406.01042

Published 6/4/2024 by Fang Li, Hao Zhang, Narendra Ahuja

Self-Calibrating 4D Novel View Synthesis from Monocular Videos Using Gaussian Splatting

Abstract

Gaussian Splatting (GS) has significantly elevated scene reconstruction efficiency and novel view synthesis (NVS) accuracy compared to Neural Radiance Fields (NeRF), particularly for dynamic scenes. However, current 4D NVS methods, whether based on GS or NeRF, primarily rely on camera parameters provided by COLMAP and even utilize sparse point clouds generated by COLMAP for initialization, which lack accuracy as well are time-consuming. This sometimes results in poor dynamic scene representation, especially in scenes with large object movements, or extreme camera conditions e.g. small translations combined with large rotations. Some studies simultaneously optimize the estimation of camera parameters and scenes, supervised by additional information like depth, optical flow, etc. obtained from off-the-shelf models. Using this unverified information as ground truth can reduce robustness and accuracy, which does frequently occur for long monocular videos (with e.g. > hundreds of frames). We propose a novel approach that learns a high-fidelity 4D GS scene representation with self-calibration of camera parameters. It includes the extraction of 2D point features that robustly represent 3D structure, and their use for subsequent joint optimization of camera parameters and 3D structure towards overall 4D scene optimization. We demonstrate the accuracy and time efficiency of our method through extensive quantitative and qualitative experimental results on several standard benchmarks. The results show significant improvements over state-of-the-art methods for 4D novel view synthesis. The source code will be released soon at https://github.com/fangli333/SC-4DGS.

Create account to get full access

Overview

This paper presents a novel method for 4D novel view synthesis from monocular videos using Gaussian splatting.
The approach is self-calibrating, meaning it can estimate the camera parameters from the input video without the need for additional calibration.
The method leverages a Gaussian representation of 3D geometry to enable efficient and high-quality novel view synthesis.

Plain English Explanation

The paper describes a technique for creating new, realistic-looking 3D video animations from a single regular video recording. This is known as "novel view synthesis," and it allows you to virtually move the camera around and see the scene from different angles, even if those angles weren't captured in the original video.

The key innovation in this work is the use of a mathematical technique called "Gaussian splatting" to represent the 3D geometry of the scene. This allows the system to efficiently synthesize new views without needing a lot of complex 3D data. Importantly, the approach is "self-calibrating," meaning it can figure out the camera parameters (like its position and orientation) automatically from the input video, without requiring any additional setup or calibration.

By combining the Gaussian splatting representation with the self-calibration capability, the method can produce high-quality novel views from monocular (single-camera) video inputs in an efficient and practical way. This could be useful for applications like virtual reality, video editing, and 3D content creation.

Technical Explanation

The paper introduces a method for 4D novel view synthesis from monocular videos. The key innovation is the use of a Gaussian splatting representation to model the 3D geometry of the scene, which enables efficient and high-quality novel view synthesis.

The approach is self-calibrating, meaning it can estimate the camera parameters from the input video without any additional calibration. This is achieved by jointly optimizing the camera parameters and the Gaussian splatting representation during training.

The Gaussian splatting representation allows for a compact and smooth 3D model that can be efficiently rendered from new viewpoints. The method also incorporates a refined 3D Gaussian representation to further improve the quality of the novel views.

Critical Analysis

The paper presents a compelling approach for 4D novel view synthesis that addresses several key challenges in the field. The self-calibrating nature of the method is a notable strength, as it reduces the burden of manual camera calibration.

However, the paper does not extensively discuss the limitations of the Gaussian splatting representation. While it is efficient, there may be cases where the representation is not flexible enough to capture complex 3D geometries accurately. Additionally, the paper does not provide a thorough comparison to other state-of-the-art methods in terms of quantitative metrics and visual quality.

Further research could explore ways to combine the Gaussian splatting approach with other techniques, such as neural radiance fields, to achieve even higher-fidelity novel view synthesis. Investigating the method's performance on a wider range of scenes and scenarios would also be valuable.

Conclusion

This paper presents a novel method for 4D novel view synthesis from monocular videos using a self-calibrating Gaussian splatting approach. The key innovations include the use of a compact and efficient 3D representation and the ability to estimate camera parameters directly from the input video.

The proposed technique could have significant implications for a range of applications, such as virtual reality, video editing, and 3D content creation, by enabling the generation of realistic and dynamic 3D environments from standard video inputs. The self-calibrating nature of the method also makes it more accessible and practical for real-world use cases.

While the paper demonstrates promising results, further research is needed to fully understand the limitations and explore ways to enhance the method's flexibility and performance. Overall, this work represents an important step forward in the field of 4D novel view synthesis.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

FSGS: Real-Time Few-shot View Synthesis using Gaussian Splatting

Zehao Zhu, Zhiwen Fan, Yifan Jiang, Zhangyang Wang

Novel view synthesis from limited observations remains an important and persistent task. However, high efficiency in existing NeRF-based few-shot view synthesis is often compromised to obtain an accurate 3D representation. To address this challenge, we propose a few-shot view synthesis framework based on 3D Gaussian Splatting that enables real-time and photo-realistic view synthesis with as few as three training views. The proposed method, dubbed FSGS, handles the extremely sparse initialized SfM points with a thoughtfully designed Gaussian Unpooling process. Our method iteratively distributes new Gaussians around the most representative locations, subsequently infilling local details in vacant areas. We also integrate a large-scale pre-trained monocular depth estimator within the Gaussians optimization process, leveraging online augmented views to guide the geometric optimization towards an optimal solution. Starting from sparse points observed from limited input viewpoints, our FSGS can accurately grow into unseen regions, comprehensively covering the scene and boosting the rendering quality of novel views. Overall, FSGS achieves state-of-the-art performance in both accuracy and rendering efficiency across diverse datasets, including LLFF, Mip-NeRF360, and Blender. Project website: https://zehaozhu.github.io/FSGS/.

6/18/2024

cs.CV

3D Geometry-aware Deformable Gaussian Splatting for Dynamic View Synthesis

Zhicheng Lu, Xiang Guo, Le Hui, Tianrui Chen, Min Yang, Xiao Tang, Feng Zhu, Yuchao Dai

In this paper, we propose a 3D geometry-aware deformable Gaussian Splatting method for dynamic view synthesis. Existing neural radiance fields (NeRF) based solutions learn the deformation in an implicit manner, which cannot incorporate 3D scene geometry. Therefore, the learned deformation is not necessarily geometrically coherent, which results in unsatisfactory dynamic view synthesis and 3D dynamic reconstruction. Recently, 3D Gaussian Splatting provides a new representation of the 3D scene, building upon which the 3D geometry could be exploited in learning the complex 3D deformation. Specifically, the scenes are represented as a collection of 3D Gaussian, where each 3D Gaussian is optimized to move and rotate over time to model the deformation. To enforce the 3D scene geometry constraint during deformation, we explicitly extract 3D geometry features and integrate them in learning the 3D deformation. In this way, our solution achieves 3D geometry-aware deformation modeling, which enables improved dynamic view synthesis and 3D dynamic reconstruction. Extensive experimental results on both synthetic and real datasets prove the superiority of our solution, which achieves new state-of-the-art performance. The project is available at https://npucvr.github.io/GaGS/

4/16/2024

cs.CV

WE-GS: An In-the-wild Efficient 3D Gaussian Representation for Unconstrained Photo Collections

Yuze Wang, Junyi Wang, Yue Qi

Novel View Synthesis (NVS) from unconstrained photo collections is challenging in computer graphics. Recently, 3D Gaussian Splatting (3DGS) has shown promise for photorealistic and real-time NVS of static scenes. Building on 3DGS, we propose an efficient point-based differentiable rendering framework for scene reconstruction from photo collections. Our key innovation is a residual-based spherical harmonic coefficients transfer module that adapts 3DGS to varying lighting conditions and photometric post-processing. This lightweight module can be pre-computed and ensures efficient gradient propagation from rendered images to 3D Gaussian attributes. Additionally, we observe that the appearance encoder and the transient mask predictor, the two most critical parts of NVS from unconstrained photo collections, can be mutually beneficial. We introduce a plug-and-play lightweight spatial attention module to simultaneously predict transient occluders and latent appearance representation for each image. After training and preprocessing, our method aligns with the standard 3DGS format and rendering pipeline, facilitating seamlessly integration into various 3DGS applications. Extensive experiments on diverse datasets show our approach outperforms existing approaches on the rendering quality of novel view and appearance synthesis with high converge and rendering speed.

6/5/2024

cs.CV

SparseGS: Real-Time 360{deg} Sparse View Synthesis using Gaussian Splatting

Haolin Xiong, Sairisheek Muttukuru, Rishi Upadhyay, Pradyumna Chari, Achuta Kadambi

The problem of novel view synthesis has grown significantly in popularity recently with the introduction of Neural Radiance Fields (NeRFs) and other implicit scene representation methods. A recent advance, 3D Gaussian Splatting (3DGS), leverages an explicit representation to achieve real-time rendering with high-quality results. However, 3DGS still requires an abundance of training views to generate a coherent scene representation. In few shot settings, similar to NeRF, 3DGS tends to overfit to training views, causing background collapse and excessive floaters, especially as the number of training views are reduced. We propose a method to enable training coherent 3DGS-based radiance fields of 360-degree scenes from sparse training views. We integrate depth priors with generative and explicit constraints to reduce background collapse, remove floaters, and enhance consistency from unseen viewpoints. Experiments show that our method outperforms base 3DGS by 6.4% in LPIPS and by 12.2% in PSNR, and NeRF-based methods by at least 17.6% in LPIPS on the MipNeRF-360 dataset with substantially less training and inference cost.

5/14/2024

cs.CV cs.LG eess.IV