360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model

Read original: arXiv:2401.06578 - Published 5/13/2024 by Qian Wang, Weiqi Li, Chong Mou, Xinhua Cheng, Jian Zhang

360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model

Overview

This paper presents a novel approach called 360DVD for generating controllable panorama videos using a 360-degree video diffusion model.
The model is trained on a large dataset of 360-degree videos and can generate diverse and coherent panoramic videos conditioned on text prompts.
The authors introduce several techniques to improve the quality and controllability of the generated videos, including a multi-scale architecture, temporal consistency modeling, and a novel decoding strategy.
Experiments show that 360DVD outperforms existing methods for text-to-360-degree video generation in terms of both visual quality and alignment with the input text.

Plain English Explanation

The 360DVD paper describes a new way to create 360-degree panoramic videos based on text descriptions. The authors have developed an AI model that can take a text prompt, like "a serene mountain landscape with a flowing river," and generate a corresponding 360-degree video that captures that scene.

This is a significant advance over previous methods, which could only generate static 360-degree images. By generating videos, the 360DVD model can create more dynamic and immersive panoramic content. The model is also more "controllable," meaning users can give it more specific text instructions to guide the video generation.

The key innovation in 360DVD is the use of a "diffusion model," which is a type of AI that learns to generate new content by studying a large dataset of example videos. The model can then use that learned knowledge to create novel panoramic videos from scratch, based on the text prompts.

To make the generated videos high-quality and coherent, the authors developed several techniques, like modeling the temporal consistency between video frames and using a multi-scale architecture to capture details at different scales. These innovations allow 360DVD to outperform previous methods for text-to-360-degree video generation.

Overall, the 360DVD model makes it easier and more accessible to create engaging 360-degree video content, with potential applications in virtual reality, entertainment, and more. By bridging the gap between text and immersive 360-degree video, this research represents an important step forward in the field of AI-generated multimedia.

Technical Explanation

The 360DVD paper introduces a novel approach for generating controllable panorama videos using a 360-degree video diffusion model. The key idea is to leverage the power of diffusion models, which have shown impressive results in text-to-image generation, and adapt them to the task of text-to-360-degree video generation.

The authors train their 360DVD model on a large dataset of 360-degree videos, allowing it to learn the underlying structure and dynamics of panoramic content. To generate a new video, the model takes a text prompt as input and then iteratively refines a noisy 360-degree video representation until it matches the desired content.

Several techniques are introduced to improve the quality and controllability of the generated videos. First, the authors use a multi-scale architecture that captures details at different spatial resolutions, enabling the model to generate high-fidelity panoramas. Second, they incorporate temporal consistency modeling to ensure smooth transitions between video frames.

The authors also propose a novel decoding strategy that allows for more precise control over the generated content. By using a combination of text-conditioning and latent space manipulation, the model can generate videos that closely align with the input text prompt while still maintaining diversity and creativity.

Extensive experiments demonstrate the effectiveness of the 360DVD approach. Compared to previous methods for text-to-360-degree video generation, such as DreamScene360 and TwinDiffusion, the 360DVD model achieves significantly better performance in terms of visual quality, text-video alignment, and overall coherence.

Critical Analysis

The 360DVD paper presents an impressive advancement in the field of text-to-video generation, particularly for the creation of immersive 360-degree content. The authors have successfully adapted the power of diffusion models to the challenging task of generating coherent and controllable panoramic videos.

One potential limitation mentioned in the paper is the computational complexity of the 360DVD model, which may hinder its real-time application in some scenarios. The authors suggest exploring more efficient architectures or decoding strategies to address this issue.

Additionally, while the 360DVD model demonstrates strong performance on a diverse dataset of 360-degree videos, it would be interesting to see how it handles more specialized or domain-specific video content. Further research could investigate the model's ability to generate panoramic videos for particular applications, such as DiffusionDollar2Dollar for dynamic 3D content generation or Direct Video for user-directed video generation.

Overall, the 360DVD paper represents a significant advancement in the field of AI-generated multimedia and opens up new possibilities for the creation of immersive, text-guided panoramic video content. As the authors note, this research could have far-reaching implications for virtual reality, entertainment, and other applications that require engaging and customizable 360-degree video experiences.

Conclusion

The 360DVD paper presents a novel approach for generating controllable panorama videos using a 360-degree video diffusion model. By adapting the power of diffusion models to the task of text-to-360-degree video generation, the authors have developed a model that can create high-quality and coherent panoramic videos based on text prompts.

The key innovations in 360DVD include a multi-scale architecture, temporal consistency modeling, and a novel decoding strategy that allows for precise control over the generated content. Experiments show that the 360DVD model outperforms previous methods for text-to-360-degree video generation, opening up new possibilities for the creation of immersive and customizable panoramic video experiences.

This research represents an important step forward in the field of AI-generated multimedia, bridging the gap between text and 360-degree video content. The potential applications of 360DVD span virtual reality, entertainment, and beyond, as the ability to generate engaging panoramic videos from text prompts could have a transformative impact on how we create and consume visual media.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model

Qian Wang, Weiqi Li, Chong Mou, Xinhua Cheng, Jian Zhang

Panorama video recently attracts more interest in both study and application, courtesy of its immersive experience. Due to the expensive cost of capturing 360-degree panoramic videos, generating desirable panorama videos by prompts is urgently required. Lately, the emerging text-to-video (T2V) diffusion methods demonstrate notable effectiveness in standard video generation. However, due to the significant gap in content and motion patterns between panoramic and standard videos, these methods encounter challenges in yielding satisfactory 360-degree panoramic videos. In this paper, we propose a pipeline named 360-Degree Video Diffusion model (360DVD) for generating 360-degree panoramic videos based on the given prompts and motion conditions. Specifically, we introduce a lightweight 360-Adapter accompanied by 360 Enhancement Techniques to transform pre-trained T2V models for panorama video generation. We further propose a new panorama dataset named WEB360 consisting of panoramic video-text pairs for training 360DVD, addressing the absence of captioned panoramic video datasets. Extensive experiments demonstrate the superiority and effectiveness of 360DVD for panorama video generation. Our project page is at https://akaneqwq.github.io/360DVD/.

5/13/2024

Taming Stable Diffusion for Text to 360{deg} Panorama Image Generation

Cheng Zhang, Qianyi Wu, Camilo Cruz Gambardella, Xiaoshui Huang, Dinh Phung, Wanli Ouyang, Jianfei Cai

Generative models, e.g., Stable Diffusion, have enabled the creation of photorealistic images from text prompts. Yet, the generation of 360-degree panorama images from text remains a challenge, particularly due to the dearth of paired text-panorama data and the domain gap between panorama and perspective images. In this paper, we introduce a novel dual-branch diffusion model named PanFusion to generate a 360-degree image from a text prompt. We leverage the stable diffusion model as one branch to provide prior knowledge in natural image generation and register it to another panorama branch for holistic image generation. We propose a unique cross-attention mechanism with projection awareness to minimize distortion during the collaborative denoising process. Our experiments validate that PanFusion surpasses existing methods and, thanks to its dual-branch structure, can integrate additional constraints like room layout for customized panorama outputs. Code is available at https://chengzhag.github.io/publication/panfusion.

4/12/2024

New!360PanT: Training-Free Text-Driven 360-Degree Panorama-to-Panorama Translation

Hai Wang, Jing-Hao Xue

Preserving boundary continuity in the translation of 360-degree panoramas remains a significant challenge for existing text-driven image-to-image translation methods. These methods often produce visually jarring discontinuities at the translated panorama's boundaries, disrupting the immersive experience. To address this issue, we propose 360PanT, a training-free approach to text-based 360-degree panorama-to-panorama translation with boundary continuity. Our 360PanT achieves seamless translations through two key components: boundary continuity encoding and seamless tiling translation with spatial control. Firstly, the boundary continuity encoding embeds critical boundary continuity information of the input 360-degree panorama into the noisy latent representation by constructing an extended input image. Secondly, leveraging this embedded noisy latent representation and guided by a target prompt, the seamless tiling translation with spatial control enables the generation of a translated image with identical left and right halves while adhering to the extended input's structure and semantic layout. This process ensures a final translated 360-degree panorama with seamless boundary continuity. Experimental results on both real-world and synthesized datasets demonstrate the effectiveness of our 360PanT in translating 360-degree panoramas. Code is available at href{https://github.com/littlewhitesea/360PanT}{https://github.com/littlewhitesea/360PanT}.

9/16/2024

4K4DGen: Panoramic 4D Generation at 4K Resolution

Renjie Li, Panwang Pan, Bangbang Yang, Dejia Xu, Shijie Zhou, Xuanyang Zhang, Zeming Li, Achuta Kadambi, Zhangyang Wang, Zhiwen Fan

The blooming of virtual reality and augmented reality (VR/AR) technologies has driven an increasing demand for the creation of high-quality, immersive, and dynamic environments. However, existing generative techniques either focus solely on dynamic objects or perform outpainting from a single perspective image, failing to meet the needs of VR/AR applications. In this work, we tackle the challenging task of elevating a single panorama to an immersive 4D experience. For the first time, we demonstrate the capability to generate omnidirectional dynamic scenes with 360-degree views at 4K resolution, thereby providing an immersive user experience. Our method introduces a pipeline that facilitates natural scene animations and optimizes a set of 4D Gaussians using efficient splatting techniques for real-time exploration. To overcome the lack of scene-scale annotated 4D data and models, especially in panoramic formats, we propose a novel Panoramic Denoiser that adapts generic 2D diffusion priors to animate consistently in 360-degree images, transforming them into panoramic videos with dynamic scenes at targeted regions. Subsequently, we elevate the panoramic video into a 4D immersive environment while preserving spatial and temporal consistency. By transferring prior knowledge from 2D models in the perspective domain to the panoramic domain and the 4D lifting with spatial appearance and geometry regularization, we achieve high-quality Panorama-to-4D generation at a resolution of (4096 $times$ 2048) for the first time. See the project website at https://4k4dgen.github.io.

7/8/2024