Dynamic Gaussian Marbles for Novel View Synthesis of Casual Monocular Videos

2406.18717

Published 6/28/2024 by Colton Stearns, Adam Harley, Mikaela Uy, Florian Dubost, Federico Tombari, Gordon Wetzstein, Leonidas Guibas

cs.CV

Dynamic Gaussian Marbles for Novel View Synthesis of Casual Monocular Videos

Abstract

Gaussian splatting has become a popular representation for novel-view synthesis, exhibiting clear strengths in efficiency, photometric quality, and compositional edibility. Following its success, many works have extended Gaussians to 4D, showing that dynamic Gaussians maintain these benefits while also tracking scene geometry far better than alternative representations. Yet, these methods assume dense multi-view videos as supervision, constraining their use to controlled capture settings. In this work, we extend the capability of Gaussian scene representations to casually captured monocular videos. We show that existing 4D Gaussian methods dramatically fail in this setup because the monocular setting is underconstrained. Building off this finding, we propose Dynamic Gaussian Marbles (DGMarbles), consisting of three core modifications that target the difficulties of the monocular setting. First, DGMarbles uses isotropic Gaussian marbles, reducing the degrees of freedom of each Gaussian, and constraining the optimization to focus on motion and appearance over local shape. Second, DGMarbles employs a hierarchical divide-and-conquer learning strategy to guide the optimization towards solutions with coherent motion. Finally, DGMarbles adds image-level and geometry-level priors into the optimization, including a tracking loss that takes advantage of recent progress in point tracking. By constraining the optimization in these ways, DGMarbles learns Gaussian trajectories that enable novel-view rendering and accurately capture the 3D motion of the scene elements. We evaluate on the (monocular) Nvidia Dynamic Scenes dataset and the Dycheck iPhone dataset, and show that DGMarbles significantly outperforms other Gaussian baselines in quality, and is on-par with non-Gaussian representations, all while maintaining the efficiency, compositionality, editability, and tracking benefits of Gaussians.

Create account to get full access

Overview

This paper proposes a novel method for novel view synthesis of casual monocular videos using "Dynamic Gaussian Marbles".
The method represents 3D scene geometry using a collection of dynamic Gaussian splatts, which can be efficiently rendered and edited.
The system is able to synthesize novel views from a single input video, enabling applications like video editing and virtual camera control.

Plain English Explanation

The paper presents a new way to create 3D models from regular videos taken with a single camera. Instead of trying to build a detailed 3D mesh, the approach uses a collection of "Gaussian marbles" - simple 3D shapes that can represent the geometry of the scene.

These Gaussian marbles have a few key advantages:

Dynamic Gaussians Mesh Consistent: They can change shape and position over time to match the movement in the video, creating a dynamic 3D model.
MODGS: Dynamic Gaussian Splatting: They can be efficiently rendered, allowing the system to generate new views of the scene from different angles.
3D Geometry Aware Deformable Gaussian Splatting: The Gaussian marbles can be easily edited, enabling applications like video editing and virtual camera control.

Overall, this novel approach allows 3D content to be created from regular videos in a way that is both efficient and editable, opening up new possibilities for video manipulation and virtual cinematography.

Technical Explanation

The key technical innovation in this paper is the use of "Dynamic Gaussian Marbles" to represent the 3D geometry of a scene captured in a monocular video. Each Gaussian marble is a simple 3D shape that can change its position, size, and orientation over time to match the movement in the input video.

The system first extracts a set of Gaussian marbles from the video using a novel Sparse Controlled Gaussian Splatting algorithm. This allows the 3D geometry to be represented in a compact and efficient way.

Next, the Dynamic 3D Gaussians Distillation module is used to refine the Gaussian marbles, ensuring that they accurately capture the 3D structure of the scene. This enables the system to synthesize novel views of the scene from different camera angles.

Critical Analysis

The proposed method is a clever and efficient approach to the challenging problem of novel view synthesis from monocular video. By representing the 3D scene geometry using a collection of dynamic Gaussian marbles, the system is able to overcome some of the limitations of traditional 3D reconstruction techniques.

However, the paper does acknowledge some potential limitations of the approach. For example, the Gaussian marbles may struggle to capture fine details or complex geometric structures, which could limit the visual fidelity of the synthesized novel views. Additionally, the system's reliance on a single input video means that it may not be able to handle scenes with significant occlusions or dramatic changes in lighting.

Further research could explore ways to address these limitations, such as by integrating the Gaussian marble representation with other 3D modeling techniques or by leveraging additional sensor data (e.g., depth sensors, multiple cameras) to improve the quality and robustness of the novel view synthesis.

Overall, the "Dynamic Gaussian Marbles" approach represents an interesting and promising step forward in the field of 3D scene reconstruction and novel view synthesis from monocular video.

Conclusion

This paper introduces a novel method for synthesizing novel views of a scene from a single input monocular video. By representing the 3D scene geometry using a collection of dynamic Gaussian marbles, the system is able to efficiently capture the movement and structure of the scene, enabling applications like video editing and virtual camera control.

The key technical contributions include the Sparse Controlled Gaussian Splatting algorithm for extracting the Gaussian marbles from the video, and the Dynamic 3D Gaussians Distillation module for refining the 3D representation.

The authors demonstrate the effectiveness of their approach through extensive experiments, showing that it outperforms state-of-the-art methods for novel view synthesis on a range of casual monocular video datasets. While the method has some limitations, it represents an exciting step forward in the field of 3D scene reconstruction and opens up new possibilities for video manipulation and virtual cinematography.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🤷

Dynamic Gaussians Mesh: Consistent Mesh Reconstruction from Monocular Videos

Isabella Liu, Hao Su, Xiaolong Wang

Modern 3D engines and graphics pipelines require mesh as a memory-efficient representation, which allows efficient rendering, geometry processing, texture editing, and many other downstream operations. However, it is still highly difficult to obtain high-quality mesh in terms of structure and detail from monocular visual observations. The problem becomes even more challenging for dynamic scenes and objects. To this end, we introduce Dynamic Gaussians Mesh (DG-Mesh), a framework to reconstruct a high-fidelity and time-consistent mesh given a single monocular video. Our work leverages the recent advancement in 3D Gaussian Splatting to construct the mesh sequence with temporal consistency from a video. Building on top of this representation, DG-Mesh recovers high-quality meshes from the Gaussian points and can track the mesh vertices over time, which enables applications such as texture editing on dynamic objects. We introduce the Gaussian-Mesh Anchoring, which encourages evenly distributed Gaussians, resulting better mesh reconstruction through mesh-guided densification and pruning on the deformed Gaussians. By applying cycle-consistent deformation between the canonical and the deformed space, we can project the anchored Gaussian back to the canonical space and optimize Gaussians across all time frames. During the evaluation on different datasets, DG-Mesh provides significantly better mesh reconstruction and rendering than baselines. Project page: https://www.liuisabella.com/DG-Mesh/

4/23/2024

cs.CV

MoDGS: Dynamic Gaussian Splatting from Causually-captured Monocular Videos

Qingming Liu, Yuan Liu, Jiepeng Wang, Xianqiang Lv, Peng Wang, Wenping Wang, Junhui Hou

In this paper, we propose MoDGS, a new pipeline to render novel-view images in dynamic scenes using only casually captured monocular videos. Previous monocular dynamic NeRF or Gaussian Splatting methods strongly rely on the rapid movement of input cameras to construct multiview consistency but fail to reconstruct dynamic scenes on casually captured input videos whose cameras are static or move slowly. To address this challenging task, MoDGS adopts recent single-view depth estimation methods to guide the learning of the dynamic scene. Then, a novel 3D-aware initialization method is proposed to learn a reasonable deformation field and a new robust depth loss is proposed to guide the learning of dynamic scene geometry. Comprehensive experiments demonstrate that MoDGS is able to render high-quality novel view images of dynamic scenes from just a casually captured monocular video, which outperforms baseline methods by a significant margin.

6/4/2024

cs.CV

3D Geometry-aware Deformable Gaussian Splatting for Dynamic View Synthesis

Zhicheng Lu, Xiang Guo, Le Hui, Tianrui Chen, Min Yang, Xiao Tang, Feng Zhu, Yuchao Dai

In this paper, we propose a 3D geometry-aware deformable Gaussian Splatting method for dynamic view synthesis. Existing neural radiance fields (NeRF) based solutions learn the deformation in an implicit manner, which cannot incorporate 3D scene geometry. Therefore, the learned deformation is not necessarily geometrically coherent, which results in unsatisfactory dynamic view synthesis and 3D dynamic reconstruction. Recently, 3D Gaussian Splatting provides a new representation of the 3D scene, building upon which the 3D geometry could be exploited in learning the complex 3D deformation. Specifically, the scenes are represented as a collection of 3D Gaussian, where each 3D Gaussian is optimized to move and rotate over time to model the deformation. To enforce the 3D scene geometry constraint during deformation, we explicitly extract 3D geometry features and integrate them in learning the 3D deformation. In this way, our solution achieves 3D geometry-aware deformation modeling, which enables improved dynamic view synthesis and 3D dynamic reconstruction. Extensive experimental results on both synthetic and real datasets prove the superiority of our solution, which achieves new state-of-the-art performance. The project is available at https://npucvr.github.io/GaGS/

4/16/2024

cs.CV

SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes

Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, Xiaojuan Qi

Novel view synthesis for dynamic scenes is still a challenging problem in computer vision and graphics. Recently, Gaussian splatting has emerged as a robust technique to represent static scenes and enable high-quality and real-time novel view synthesis. Building upon this technique, we propose a new representation that explicitly decomposes the motion and appearance of dynamic scenes into sparse control points and dense Gaussians, respectively. Our key idea is to use sparse control points, significantly fewer in number than the Gaussians, to learn compact 6 DoF transformation bases, which can be locally interpolated through learned interpolation weights to yield the motion field of 3D Gaussians. We employ a deformation MLP to predict time-varying 6 DoF transformations for each control point, which reduces learning complexities, enhances learning abilities, and facilitates obtaining temporal and spatial coherent motion patterns. Then, we jointly learn the 3D Gaussians, the canonical space locations of control points, and the deformation MLP to reconstruct the appearance, geometry, and dynamics of 3D scenes. During learning, the location and number of control points are adaptively adjusted to accommodate varying motion complexities in different regions, and an ARAP loss following the principle of as rigid as possible is developed to enforce spatial continuity and local rigidity of learned motions. Finally, thanks to the explicit sparse motion representation and its decomposition from appearance, our method can enable user-controlled motion editing while retaining high-fidelity appearances. Extensive experiments demonstrate that our approach outperforms existing approaches on novel view synthesis with a high rendering speed and enables novel appearance-preserved motion editing applications. Project page: https://yihua7.github.io/SC-GS-web/

4/15/2024

cs.CV cs.GR