DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos

2405.02280

Published 5/24/2024 by Wen-Hsuan Chu, Lei Ke, Katerina Fragkiadaki

🛸

Abstract

View-predictive generative models provide strong priors for lifting object-centric images and videos into 3D and 4D through rendering and score distillation objectives. A question then remains: what about lifting complete multi-object dynamic scenes? There are two challenges in this direction: First, rendering error gradients are often insufficient to recover fast object motion, and second, view predictive generative models work much better for objects than whole scenes, so, score distillation objectives cannot currently be applied at the scene level directly. We present DreamScene4D, the first approach to generate 3D dynamic scenes of multiple objects from monocular videos via 360-degree novel view synthesis. Our key insight is a decompose-recompose approach that factorizes the video scene into the background and object tracks, while also factorizing object motion into 3 components: object-centric deformation, object-to-world-frame transformation, and camera motion. Such decomposition permits rendering error gradients and object view-predictive models to recover object 3D completions and deformations while bounding box tracks guide the large object movements in the scene. We show extensive results on challenging DAVIS, Kubric, and self-captured videos with quantitative comparisons and a user preference study. Besides 4D scene generation, DreamScene4D obtains accurate 2D persistent point track by projecting the inferred 3D trajectories to 2D. We will release our code and hope our work will stimulate more research on fine-grained 4D understanding from videos.

Create account to get full access

Overview

This paper presents a novel approach called DreamScene4D that can generate 3D dynamic scenes from monocular in-the-wild videos with complex object motion and occlusions.
It builds upon progress in video object tracking and generative models to tackle the challenging task of 2D-to-3D object lifting.
The key insight is to decompose the video scene into objects and background, model their 3D motion separately, and then recompose the final 4D scene.

Plain English Explanation

DreamScene4D is a system that can take regular 2D videos and turn them into 3D dynamic scenes. This is a challenging problem because videos can have objects moving around and getting blocked by other things.

The researchers used some clever tricks to solve this. First, they broke down the video into the different objects and the background. They used special AI models to track the objects and fill in the parts that get hidden. Then, they figured out how each object is moving in 3D space and how the camera is moving. Finally, they put everything back together to create a full 3D scene that matches the original video.

This is an impressive technical feat that could have all sorts of applications, like creating 3D animations from real-world footage or enhancing video games and virtual reality experiences. By breaking down the problem into smaller steps, the researchers were able to tackle the challenges of 2D-to-3D object lifting in a new way.

Technical Explanation

The key innovation in DreamScene4D is its "decompose-then-recompose" approach to generating 3D dynamic scenes from monocular videos.

First, the system uses open-vocabulary mask trackers and an adapted image diffusion model to segment, track, and amodally complete the objects and background in the input video. This gives it a decomposed representation of the scene.

Next, it maps each object track to a set of 3D Gaussians that can deform and move over time. The system also factorizes the observed motion into multiple components to handle fast motion. It can infer the camera motion by re-rendering the background to match the video frames.

For the object motion, the system first models the object-centric deformation using rendering losses and multi-view generative priors in an object-centric frame. It then optimizes the object-centric to world-frame transformations by comparing the rendered output against the video.

Finally, the system recomposes the background and objects, optimizing for relative object scales using monocular depth prediction guidance.

This approach enables DreamScene4D to generate high-quality 3D dynamic scenes from challenging in-the-wild videos, as demonstrated on the DAVIS, Kubric, and self-captured datasets.

Critical Analysis

The paper highlights some impressive capabilities of DreamScene4D, but also acknowledges several limitations and areas for future work.

One key limitation is that the system relies on accurate object segmentation and tracking, which can be challenging in complex real-world videos. The authors suggest exploring "GUESS: Unseen Dynamic 3D Scene Reconstruction from Monocular Video" and "MVDream: Multi-View Diffusion for 3D Generation" to address this.

Additionally, the 3D motion modeling could be further improved, especially for fast-moving objects. The authors mention plans to leverage "WALT3D: Generating Realistic Training Data from Time" to generate more diverse training data.

Overall, DreamScene4D represents an exciting step forward in the challenging problem of 2D-to-3D object lifting. While not perfect, the research highlights the potential of decomposition-based approaches to tackle complex 3D scene generation from monocular videos.

Conclusion

The DreamScene4D paper presents a novel approach for generating 3D dynamic scenes from monocular in-the-wild videos. By carefully decomposing the video into objects and background, modeling their 3D motion separately, and then recomposing the final scene, the system can handle complex object motion and occlusions.

This work builds on progress in video object tracking and generative models, demonstrating the power of a divide-and-conquer strategy for tackling the challenging task of 2D-to-3D object lifting. While the system has some limitations, it represents an important step forward in the field of 3D scene generation from 2D inputs.

The potential applications of this technology are wide-ranging, from creating 3D animations from real-world footage to enhancing video games and virtual reality experiences. As the researchers continue to refine and expand their approach, we can expect to see even more impressive results in the future.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🧪

Decoupling Dynamic Monocular Videos for Dynamic View Synthesis

Meng You, Junhui Hou

The challenge of dynamic view synthesis from dynamic monocular videos, i.e., synthesizing novel views for free viewpoints given a monocular video of a dynamic scene captured by a moving camera, mainly lies in accurately modeling the textbf{dynamic objects} of a scene using limited 2D frames, each with a varying timestamp and viewpoint. Existing methods usually require pre-processed 2D optical flow and depth maps by off-the-shelf methods to supervise the network, making them suffer from the inaccuracy of the pre-processed supervision and the ambiguity when lifting the 2D information to 3D. In this paper, we tackle this challenge in an unsupervised fashion. Specifically, we decouple the motion of the dynamic objects into object motion and camera motion, respectively regularized by proposed unsupervised surface consistency and patch-based multi-view constraints. The former enforces the 3D geometric surfaces of moving objects to be consistent over time, while the latter regularizes their appearances to be consistent across different viewpoints. Such a fine-grained motion formulation can alleviate the learning difficulty for the network, thus enabling it to produce not only novel views with higher quality but also more accurate scene flows and depth than existing methods requiring extra supervision.

6/3/2024

cs.CV

🛸

4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

Heng Yu, Chaoyang Wang, Peiye Zhuang, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Laszlo A Jeni, Sergey Tulyakov, Hsin-Ying Lee

Existing dynamic scene generation methods mostly rely on distilling knowledge from pre-trained 3D generative models, which are typically fine-tuned on synthetic object datasets. As a result, the generated scenes are often object-centric and lack photorealism. To address these limitations, we introduce a novel pipeline designed for photorealistic text-to-4D scene generation, discarding the dependency on multi-view generative models and instead fully utilizing video generative models trained on diverse real-world datasets. Our method begins by generating a reference video using the video generation model. We then learn the canonical 3D representation of the video using a freeze-time video, delicately generated from the reference video. To handle inconsistencies in the freeze-time video, we jointly learn a per-frame deformation to model these imperfections. We then learn the temporal deformation based on the canonical representation to capture dynamic interactions in the reference video. The pipeline facilitates the generation of dynamic scenes with enhanced photorealism and structural integrity, viewable from multiple perspectives, thereby setting a new standard in 4D scene generation.

6/12/2024

cs.CV

A Unified Approach for Text- and Image-guided 4D Scene Generation

Yufeng Zheng, Xueting Li, Koki Nagano, Sifei Liu, Karsten Kreis, Otmar Hilliges, Shalini De Mello

Large-scale diffusion generative models are greatly simplifying image, video and 3D asset creation from user-provided text prompts and images. However, the challenging problem of text-to-4D dynamic 3D scene generation with diffusion guidance remains largely unexplored. We propose Dream-in-4D, which features a novel two-stage approach for text-to-4D synthesis, leveraging (1) 3D and 2D diffusion guidance to effectively learn a high-quality static 3D asset in the first stage; (2) a deformable neural radiance field that explicitly disentangles the learned static asset from its deformation, preserving quality during motion learning; and (3) a multi-resolution feature grid for the deformation field with a displacement total variation loss to effectively learn motion with video diffusion guidance in the second stage. Through a user preference study, we demonstrate that our approach significantly advances image and motion quality, 3D consistency and text fidelity for text-to-4D generation compared to baseline approaches. Thanks to its motion-disentangled representation, Dream-in-4D can also be easily adapted for controllable generation where appearance is defined by one or multiple images, without the need to modify the motion learning stage. Thus, our method offers, for the first time, a unified approach for text-to-4D, image-to-4D and personalized 4D generation tasks.

5/8/2024

cs.CV

💬

Generative Camera Dolly: Extreme Monocular Dynamic Novel View Synthesis

Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, Carl Vondrick

Accurate reconstruction of complex dynamic scenes from just a single viewpoint continues to be a challenging task in computer vision. Current dynamic novel view synthesis methods typically require videos from many different camera viewpoints, necessitating careful recording setups, and significantly restricting their utility in the wild as well as in terms of embodied AI applications. In this paper, we propose $textbf{GCD}$, a controllable monocular dynamic view synthesis pipeline that leverages large-scale diffusion priors to, given a video of any scene, generate a synchronous video from any other chosen perspective, conditioned on a set of relative camera pose parameters. Our model does not require depth as input, and does not explicitly model 3D scene geometry, instead performing end-to-end video-to-video translation in order to achieve its goal efficiently. Despite being trained on synthetic multi-view video data only, zero-shot real-world generalization experiments show promising results in multiple domains, including robotics, object permanence, and driving environments. We believe our framework can potentially unlock powerful applications in rich dynamic scene understanding, perception for robotics, and interactive 3D video viewing experiences for virtual reality.

5/24/2024

cs.CV cs.AI cs.LG cs.RO