Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion

Read original: arXiv:2407.13759 - Published 7/26/2024 by Boyang Deng, Richard Tucker, Zhengqi Li, Leonidas Guibas, Noah Snavely, Gordon Wetzstein

Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion

Overview

This paper presents a novel method for large-scale, consistent street view generation using autoregressive video diffusion models.
The model is capable of generating coherent, high-quality street view images that maintain consistency across long sequences.
The approach builds on recent advancements in diffusion models and leverages an autoregressive architecture to capture long-range dependencies in street scenes.

Plain English Explanation

The paper describes a new way to automatically generate realistic-looking street view images that stay consistent over long stretches. This is an important problem because being able to create convincing street scenes could have many practical applications, such as in video games, virtual tours, or urban planning.

The key idea is to use a type of machine learning model called a "diffusion model" that learns to convert random noise into natural-looking images. The researchers added an "autoregressive" component, which means the model can remember and build upon previous outputs to maintain coherence across a sequence of images.

This allows the model to generate lengthy, consistent street view sequences, where the details like the buildings, roads, and other elements all fit together realistically as you "move" through the scene. Previous methods struggled to keep street views consistent over long distances, but this new approach appears to solve that problem.

Technical Explanation

The paper introduces a novel street view generation framework that leverages autoregressive video diffusion to produce large-scale, consistent street view sequences.

The model is built upon recent advancements in diffusion models and incorporates an autoregressive architecture to capture long-range dependencies in street scenes. This allows the generated images to maintain coherence and consistency over long sequences, addressing a key limitation of prior street view synthesis approaches.

The authors evaluate their method on several benchmark datasets, demonstrating its ability to generate high-quality, photorealistic street views that preserve visual continuity across extended scenes. Compared to existing street view synthesis techniques, the proposed model shows significant improvements in visual quality and consistency.

Additionally, the paper explores the use of self-attention to further enhance the model's capacity for long-range reasoning and scene understanding, which is critical for generating coherent street views.

Critical Analysis

The paper presents a compelling approach to the challenging problem of large-scale, consistent street view generation. The use of autoregressive video diffusion is a novel and promising direction that builds on recent advancements in generative modeling.

One potential limitation mentioned in the paper is the computational complexity of the model, which may make it difficult to deploy in real-time applications. The authors note that further optimizations and engineering efforts could help address this issue.

Additionally, the paper does not provide a comprehensive analysis of the model's ability to generalize to diverse street scenes, such as those in different geographic regions or with varying architectural styles. Evaluating the model's robustness and adaptability to a broader range of street view data could be an important area for future research.

It would also be valuable to explore the model's potential for interactive applications, where users could modify or manipulate the generated street views in real-time. Integrating the model with other technologies, such as 3D synthesis or birds-eye-view to street-view translation, could further enhance its practical utility.

Conclusion

This paper presents a significant advance in the field of street view generation by introducing a novel autoregressive video diffusion model that can produce large-scale, consistent street view sequences. The model's ability to maintain visual continuity and coherence over long distances is a notable improvement over previous techniques, and the authors' exploration of self-attention mechanisms shows promise for further enhancing scene understanding and reasoning.

While the computational complexity of the model may present some challenges, the overall approach represents an important step forward in the quest to create realistic, interactive virtual environments that can be used for a wide range of applications, from urban planning to immersive gaming experiences.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion

Boyang Deng, Richard Tucker, Zhengqi Li, Leonidas Guibas, Noah Snavely, Gordon Wetzstein

We present a method for generating Streetscapes-long sequences of views through an on-the-fly synthesized city-scale scene. Our generation is conditioned by language input (e.g., city name, weather), as well as an underlying map/layout hosting the desired trajectory. Compared to recent models for video generation or 3D view synthesis, our method can scale to much longer-range camera trajectories, spanning several city blocks, while maintaining visual quality and consistency. To achieve this goal, we build on recent work on video diffusion, used within an autoregressive framework that can easily scale to long sequences. In particular, we introduce a new temporal imputation method that prevents our autoregressive approach from drifting from the distribution of realistic city imagery. We train our Streetscapes system on a compelling source of data-posed imagery from Google Street View, along with contextual map data-which allows users to generate city views conditioned on any desired city layout, with controllable camera poses. Please see more results at our project page at https://boyangdeng.com/streetscapes.

7/26/2024

DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes

Jianbiao Mei, Yukai Ma, Xuemeng Yang, Licheng Wen, Tiantian Wei, Min Dou, Botian Shi, Yong Liu

Recent advances in diffusion models have significantly enhanced the cotrollable generation of streetscapes for and facilitated downstream perception and planning tasks. However, challenges such as maintaining temporal coherence, generating long videos, and accurately modeling driving scenes persist. Accordingly, we propose DreamForge, an advanced diffusion-based autoregressive video generation model designed for the long-term generation of 3D-controllable and extensible video. In terms of controllability, our DreamForge supports flexible conditions such as text descriptions, camera poses, 3D bounding boxes, and road layouts, while also providing perspective guidance to produce driving scenes that are both geometrically and contextually accurate. For consistency, we ensure inter-view consistency through cross-view attention and temporal coherence via an autoregressive architecture enhanced with motion cues. Codes will be available at https://github.com/PJLab-ADG/DriveArena.

9/9/2024

From Bird's-Eye to Street View: Crafting Diverse and Condition-Aligned Images with Latent Diffusion Model

Xiaojie Xu, Tianshuo Xu, Fulong Ma, Yingcong Chen

We explore Bird's-Eye View (BEV) generation, converting a BEV map into its corresponding multi-view street images. Valued for its unified spatial representation aiding multi-sensor fusion, BEV is pivotal for various autonomous driving applications. Creating accurate street-view images from BEV maps is essential for portraying complex traffic scenarios and enhancing driving algorithms. Concurrently, diffusion-based conditional image generation models have demonstrated remarkable outcomes, adept at producing diverse, high-quality, and condition-aligned results. Nonetheless, the training of these models demands substantial data and computational resources. Hence, exploring methods to fine-tune these advanced models, like Stable Diffusion, for specific conditional generation tasks emerges as a promising avenue. In this paper, we introduce a practical framework for generating images from a BEV layout. Our approach comprises two main components: the Neural View Transformation and the Street Image Generation. The Neural View Transformation phase converts the BEV map into aligned multi-view semantic segmentation maps by learning the shape correspondence between the BEV and perspective views. Subsequently, the Street Image Generation phase utilizes these segmentations as a condition to guide a fine-tuned latent diffusion model. This finetuning process ensures both view and style consistency. Our model leverages the generative capacity of large pretrained diffusion models within traffic contexts, effectively yielding diverse and condition-coherent street view images.

9/4/2024

SkyDiffusion: Street-to-Satellite Image Synthesis with Diffusion Models and BEV Paradigm

Junyan Ye, Jun He, Weijia Li, Zhutao Lv, Jinhua Yu, Haote Yang, Conghui He

Street-to-satellite image synthesis focuses on generating realistic satellite images from corresponding ground street-view images while maintaining a consistent content layout, similar to looking down from the sky. The significant differences in perspectives create a substantial domain gap between the views, making this cross-view generation task particularly challenging. In this paper, we introduce SkyDiffusion, a novel cross-view generation method for synthesizing satellite images from street-view images, leveraging diffusion models and Bird's Eye View (BEV) paradigm. First, we design a Curved-BEV method to transform street-view images to the satellite view, reformulating the challenging cross-domain image synthesis task into a conditional generation problem. Curved-BEV also includes a Multi-to-One mapping strategy for leveraging multiple street-view images within the same satellite coverage area, effectively solving the occlusion issues in dense urban scenes. Next, we design a BEV-controlled diffusion model to generate satellite images consistent with the street-view content, which also incorporates a light manipulation module to make the lighting conditions of the synthesized satellite images more flexible. Experimental results demonstrate that SkyDiffusion outperforms state-of-the-art methods on both suburban (CVUSA & CVACT) and urban (VIGOR-Chicago) cross-view datasets, with an average SSIM increase of 13.96% and a FID reduction of 20.54%, achieving realistic and content-consistent satellite image generation. The code and models of this work will be released at https://opendatalab.github.io/skydiffusion

8/20/2024