MoDGS: Dynamic Gaussian Splatting from Causually-captured Monocular Videos

2406.00434

Published 6/4/2024 by Qingming Liu, Yuan Liu, Jiepeng Wang, Xianqiang Lv, Peng Wang, Wenping Wang, Junhui Hou

MoDGS: Dynamic Gaussian Splatting from Causually-captured Monocular Videos

Abstract

In this paper, we propose MoDGS, a new pipeline to render novel-view images in dynamic scenes using only casually captured monocular videos. Previous monocular dynamic NeRF or Gaussian Splatting methods strongly rely on the rapid movement of input cameras to construct multiview consistency but fail to reconstruct dynamic scenes on casually captured input videos whose cameras are static or move slowly. To address this challenging task, MoDGS adopts recent single-view depth estimation methods to guide the learning of the dynamic scene. Then, a novel 3D-aware initialization method is proposed to learn a reasonable deformation field and a new robust depth loss is proposed to guide the learning of dynamic scene geometry. Comprehensive experiments demonstrate that MoDGS is able to render high-quality novel view images of dynamic scenes from just a casually captured monocular video, which outperforms baseline methods by a significant margin.

Create account to get full access

Overview

This paper introduces MoDGS, a novel method for dynamic Gaussian splatting from causally-captured monocular videos.
MoDGS aims to reconstruct 4D dynamic scenes from a single RGB video, enabling novel view synthesis.
The method uses a deep learning-based approach to estimate a 3D Gaussian mixture model from the input video, which can then be rendered from any viewpoint.

Plain English Explanation

MoDGS: Dynamic Gaussian Splatting from Causually-captured Monocular Videos presents a new way to create 3D models from everyday video footage captured on a single camera. The key idea is to represent the 3D world as a collection of Gaussian "blobs" that can move and change over time.

By analyzing the video, the system can figure out the position, size, and motion of these Gaussian blobs, allowing it to reconstruct a 3D model of the scene. This 3D model can then be used to generate new views of the scene from different angles, enabling a kind of "virtual reality" experience from the original video.

The main advantage of this approach is that it only requires a standard video camera, rather than specialized 3D scanning hardware. This makes it much more accessible and practical for everyday use cases, like creating 3D models of family events or interesting locations. The researchers demonstrate the technique on a variety of scenes, showing how it can faithfully capture dynamic elements like moving people and objects.

Technical Explanation

The core of MoDGS is a deep neural network that takes a monocular video as input and outputs a 3D Gaussian mixture model (GMM) representation of the scene. This GMM encodes the position, size, and motion of Gaussian "blobs" that collectively describe the 3D structure and dynamics of the captured environment.

To train the network, the authors leverage prior work on self-calibrating 4D novel view synthesis and 3D geometry-aware deformable Gaussian splatting. They also draw inspiration from techniques for dynamic Gaussians and mesh-consistent mesh reconstruction and monocular sparse tracking and Gaussian mapping.

The key innovation of MoDGS is its ability to faithfully capture the 4D (3D + time) dynamics of a scene from a single monocular video, enabling high-quality novel view synthesis. The authors demonstrate the effectiveness of their approach through extensive experiments on a variety of dynamic scenes, showing compelling results in terms of reconstruction accuracy and rendering quality.

Critical Analysis

One potential limitation of the MoDGS approach is that it may struggle with highly complex or cluttered scenes, where the Gaussian mixture model representation might not be able to capture all the nuances of the 3D geometry and motion. The authors acknowledge this and suggest that incorporating additional cues or more advanced neural network architectures could help address this issue.

Another area for further research could be exploring ways to make the system more robust to variations in camera motion, lighting conditions, and other real-world factors that can affect the quality of the input video. Integrating techniques for improved camera calibration or handling of occlusions and specularities may be valuable directions to pursue.

Overall, the MoDGS method represents an exciting advancement in the field of novel view synthesis, offering a practical and accessible solution for 4D scene reconstruction from everyday video footage. With continued refinement and development, this type of technology could have far-reaching applications in areas like virtual reality, augmented reality, and 3D content creation.

Conclusion

The MoDGS paper introduces a novel method for dynamic 3D scene reconstruction and novel view synthesis from monocular video inputs. By representing the 3D world as a Gaussian mixture model, the system can faithfully capture the 4D (3D + time) dynamics of a scene and generate compelling new perspectives from the original video.

This approach has significant potential for a wide range of applications, from virtual reality experiences to 3D content creation and beyond. While the current system has some limitations, the authors have laid the groundwork for an exciting new direction in the field of 3D computer vision, paving the way for further advancements and real-world impact.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

New!Dynamic Gaussian Marbles for Novel View Synthesis of Casual Monocular Videos

Colton Stearns, Adam Harley, Mikaela Uy, Florian Dubost, Federico Tombari, Gordon Wetzstein, Leonidas Guibas

Gaussian splatting has become a popular representation for novel-view synthesis, exhibiting clear strengths in efficiency, photometric quality, and compositional edibility. Following its success, many works have extended Gaussians to 4D, showing that dynamic Gaussians maintain these benefits while also tracking scene geometry far better than alternative representations. Yet, these methods assume dense multi-view videos as supervision, constraining their use to controlled capture settings. In this work, we extend the capability of Gaussian scene representations to casually captured monocular videos. We show that existing 4D Gaussian methods dramatically fail in this setup because the monocular setting is underconstrained. Building off this finding, we propose Dynamic Gaussian Marbles (DGMarbles), consisting of three core modifications that target the difficulties of the monocular setting. First, DGMarbles uses isotropic Gaussian marbles, reducing the degrees of freedom of each Gaussian, and constraining the optimization to focus on motion and appearance over local shape. Second, DGMarbles employs a hierarchical divide-and-conquer learning strategy to guide the optimization towards solutions with coherent motion. Finally, DGMarbles adds image-level and geometry-level priors into the optimization, including a tracking loss that takes advantage of recent progress in point tracking. By constraining the optimization in these ways, DGMarbles learns Gaussian trajectories that enable novel-view rendering and accurately capture the 3D motion of the scene elements. We evaluate on the (monocular) Nvidia Dynamic Scenes dataset and the Dycheck iPhone dataset, and show that DGMarbles significantly outperforms other Gaussian baselines in quality, and is on-par with non-Gaussian representations, all while maintaining the efficiency, compositionality, editability, and tracking benefits of Gaussians.

6/28/2024

cs.CV

MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds

Jiahui Lei, Yijia Weng, Adam Harley, Leonidas Guibas, Kostas Daniilidis

We introduce 4D Motion Scaffolds (MoSca), a neural information processing system designed to reconstruct and synthesize novel views of dynamic scenes from monocular videos captured casually in the wild. To address such a challenging and ill-posed inverse problem, we leverage prior knowledge from foundational vision models, lift the video data to a novel Motion Scaffold (MoSca) representation, which compactly and smoothly encodes the underlying motions / deformations. The scene geometry and appearance are then disentangled from the deformation field, and are encoded by globally fusing the Gaussians anchored onto the MoSca and optimized via Gaussian Splatting. Additionally, camera poses can be seamlessly initialized and refined during the dynamic rendering process, without the need for other pose estimation tools. Experiments demonstrate state-of-the-art performance on dynamic rendering benchmarks.

5/28/2024

cs.CV cs.GR

3D Geometry-aware Deformable Gaussian Splatting for Dynamic View Synthesis

Zhicheng Lu, Xiang Guo, Le Hui, Tianrui Chen, Min Yang, Xiao Tang, Feng Zhu, Yuchao Dai

In this paper, we propose a 3D geometry-aware deformable Gaussian Splatting method for dynamic view synthesis. Existing neural radiance fields (NeRF) based solutions learn the deformation in an implicit manner, which cannot incorporate 3D scene geometry. Therefore, the learned deformation is not necessarily geometrically coherent, which results in unsatisfactory dynamic view synthesis and 3D dynamic reconstruction. Recently, 3D Gaussian Splatting provides a new representation of the 3D scene, building upon which the 3D geometry could be exploited in learning the complex 3D deformation. Specifically, the scenes are represented as a collection of 3D Gaussian, where each 3D Gaussian is optimized to move and rotate over time to model the deformation. To enforce the 3D scene geometry constraint during deformation, we explicitly extract 3D geometry features and integrate them in learning the 3D deformation. In this way, our solution achieves 3D geometry-aware deformation modeling, which enables improved dynamic view synthesis and 3D dynamic reconstruction. Extensive experimental results on both synthetic and real datasets prove the superiority of our solution, which achieves new state-of-the-art performance. The project is available at https://npucvr.github.io/GaGS/

4/16/2024

cs.CV

Self-Calibrating 4D Novel View Synthesis from Monocular Videos Using Gaussian Splatting

Fang Li, Hao Zhang, Narendra Ahuja

Gaussian Splatting (GS) has significantly elevated scene reconstruction efficiency and novel view synthesis (NVS) accuracy compared to Neural Radiance Fields (NeRF), particularly for dynamic scenes. However, current 4D NVS methods, whether based on GS or NeRF, primarily rely on camera parameters provided by COLMAP and even utilize sparse point clouds generated by COLMAP for initialization, which lack accuracy as well are time-consuming. This sometimes results in poor dynamic scene representation, especially in scenes with large object movements, or extreme camera conditions e.g. small translations combined with large rotations. Some studies simultaneously optimize the estimation of camera parameters and scenes, supervised by additional information like depth, optical flow, etc. obtained from off-the-shelf models. Using this unverified information as ground truth can reduce robustness and accuracy, which does frequently occur for long monocular videos (with e.g. > hundreds of frames). We propose a novel approach that learns a high-fidelity 4D GS scene representation with self-calibration of camera parameters. It includes the extraction of 2D point features that robustly represent 3D structure, and their use for subsequent joint optimization of camera parameters and 3D structure towards overall 4D scene optimization. We demonstrate the accuracy and time efficiency of our method through extensive quantitative and qualitative experimental results on several standard benchmarks. The results show significant improvements over state-of-the-art methods for 4D novel view synthesis. The source code will be released soon at https://github.com/fangli333/SC-4DGS.

6/4/2024

cs.CV