PerlDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Models

Read original: arXiv:2407.06109 - Published 7/17/2024 by Jinhua Zhang, Hualian Sheng, Sijia Cai, Bing Deng, Qiao Liang, Wen Li, Ying Fu, Jieping Ye, Shuhang Gu

PerlDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Models

Overview

This paper introduces PerlDiff, a novel approach for controllable street view synthesis using perspective-layout diffusion models.
PerlDiff enables users to manipulate the camera perspective and scene layout to generate realistic street view images.
The method leverages diffusion models, which are a type of generative AI that can create high-quality images from scratch or by modifying existing ones.
PerlDiff integrates perspective and layout information to provide users with fine-grained control over the generation process, allowing them to create customized street views.

Plain English Explanation

PerlDiff is a new way to generate realistic images of street scenes. Unlike traditional methods that may produce generic or unrealistic results, PerlDiff gives users a lot of control over the final image.

The key idea is to combine two powerful AI techniques: diffusion models and perspective/layout control. Diffusion models are a type of generative AI that can create high-quality images from scratch or by modifying existing ones. PerlDiff integrates information about the camera angle (perspective) and the arrangement of objects in the scene (layout) to allow users to customize the street view exactly how they want it.

For example, with PerlDiff, a user could generate a street scene with a specific angle of the camera, the placement of buildings, the position of cars, and other details. This level of control is important for applications like urban planning, virtual tourism, and even video game development, where realistic and customizable street views are essential.

The SGDIFF and MagicDrive3D papers have also explored similar ideas for generating and editing street views, but PerlDiff offers some unique advantages in terms of the level of control and the quality of the generated images.

Technical Explanation

PerlDiff is a novel approach that leverages diffusion models to enable controllable street view synthesis. The key innovation is the integration of perspective and layout information into the diffusion model, which allows users to manipulate the camera viewpoint and scene composition.

The architecture of PerlDiff consists of two main components: a perspective encoder and a layout encoder. The perspective encoder takes in information about the camera angle, such as the pitch, yaw, and roll, and encodes this into a latent representation. The layout encoder takes in a semantic layout of the scene, which describes the arrangement and properties of the various objects (e.g., buildings, roads, vehicles).

These two latent representations are then combined and used to condition a diffusion model, which can generate the final street view image. By allowing users to adjust the perspective and layout inputs, PerlDiff provides fine-grained control over the generated street scenes.

The authors evaluate PerlDiff on several benchmark datasets and demonstrate its ability to produce high-quality, customizable street views that outperform previous state-of-the-art methods. They also conduct user studies to assess the perceived realism and controllability of the generated images.

Critical Analysis

One potential limitation of PerlDiff is that it relies on a semantic layout as an input, which may not always be available or easy to obtain. The authors mention that future work could explore methods for automatically extracting the layout information from other sources, such as satellite imagery or street-level photographs.

Additionally, while PerlDiff offers impressive control over the generated street views, there may still be some constraints or artifacts that limit its flexibility. The authors acknowledge that further research is needed to explore the full range of possible manipulations and to address any remaining quality or consistency issues.

Another area for future work could be to investigate the integration of PerlDiff with other 3D generation or image editing techniques, which could expand the capabilities and applications of the system.

Conclusion

PerlDiff represents a significant advancement in the field of street view synthesis by providing users with unprecedented control over the generation process. By integrating perspective and layout information into a diffusion model, the authors have created a powerful tool that can generate realistic and customizable street scenes.

This technology has the potential to impact a wide range of applications, from urban planning and virtual tourism to video game development and beyond. As the authors continue to refine and expand the capabilities of PerlDiff, it will be exciting to see how this approach can be leveraged to transform the way we interact with and visualize our built environments.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

PerlDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Models

Jinhua Zhang, Hualian Sheng, Sijia Cai, Bing Deng, Qiao Liang, Wen Li, Ying Fu, Jieping Ye, Shuhang Gu

Controllable generation is considered a potentially vital approach to address the challenge of annotating 3D data, and the precision of such controllable generation becomes particularly imperative in the context of data production for autonomous driving. Existing methods focus on the integration of diverse generative information into controlling inputs, utilizing frameworks such as GLIGEN or ControlNet, to produce commendable outcomes in controllable generation. However, such approaches intrinsically restrict generation performance to the learning capacities of predefined network architectures. In this paper, we explore the integration of controlling information and introduce PerlDiff (Perspective-Layout Diffusion Models), a method for effective street view image generation that fully leverages perspective 3D geometric information. Our PerlDiff employs 3D geometric priors to guide the generation of street view images with precise object-level control within the network learning process, resulting in a more robust and controllable output. Moreover, it demonstrates superior controllability compared to alternative layout control methods. Empirical results justify that our PerlDiff markedly enhances the precision of generation on the NuScenes and KITTI datasets. Our codes and models are publicly available at https://github.com/LabShuHangGU/PerlDiff.

7/17/2024

CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis

Weijia Li, Jun He, Junyan Ye, Huaping Zhong, Zhimeng Zheng, Zilong Huang, Dahua Lin, Conghui He

Satellite-to-street view synthesis aims at generating a realistic street-view image from its corresponding satellite-view image. Although stable diffusion models have exhibit remarkable performance in a variety of image generation applications, their reliance on similar-view inputs to control the generated structure or texture restricts their application to the challenging cross-view synthesis task. In this work, we propose CrossViewDiff, a cross-view diffusion model for satellite-to-street view synthesis. To address the challenges posed by the large discrepancy across views, we design the satellite scene structure estimation and cross-view texture mapping modules to construct the structural and textural controls for street-view image synthesis. We further design a cross-view control guided denoising process that incorporates the above controls via an enhanced cross-view attention module. To achieve a more comprehensive evaluation of the synthesis results, we additionally design a GPT-based scoring method as a supplement to standard evaluation metrics. We also explore the effect of different data sources (e.g., text, maps, building heights, and multi-temporal satellite imagery) on this task. Results on three public cross-view datasets show that CrossViewDiff outperforms current state-of-the-art on both standard and GPT-based evaluation metrics, generating high-quality street-view panoramas with more realistic structures and textures across rural, suburban, and urban scenes. The code and models of this work will be released at https://opendatalab.github.io/CrossViewDiff/.

8/28/2024

SkyDiffusion: Street-to-Satellite Image Synthesis with Diffusion Models and BEV Paradigm

Junyan Ye, Jun He, Weijia Li, Zhutao Lv, Jinhua Yu, Haote Yang, Conghui He

Street-to-satellite image synthesis focuses on generating realistic satellite images from corresponding ground street-view images while maintaining a consistent content layout, similar to looking down from the sky. The significant differences in perspectives create a substantial domain gap between the views, making this cross-view generation task particularly challenging. In this paper, we introduce SkyDiffusion, a novel cross-view generation method for synthesizing satellite images from street-view images, leveraging diffusion models and Bird's Eye View (BEV) paradigm. First, we design a Curved-BEV method to transform street-view images to the satellite view, reformulating the challenging cross-domain image synthesis task into a conditional generation problem. Curved-BEV also includes a Multi-to-One mapping strategy for leveraging multiple street-view images within the same satellite coverage area, effectively solving the occlusion issues in dense urban scenes. Next, we design a BEV-controlled diffusion model to generate satellite images consistent with the street-view content, which also incorporates a light manipulation module to make the lighting conditions of the synthesized satellite images more flexible. Experimental results demonstrate that SkyDiffusion outperforms state-of-the-art methods on both suburban (CVUSA & CVACT) and urban (VIGOR-Chicago) cross-view datasets, with an average SSIM increase of 13.96% and a FID reduction of 20.54%, achieving realistic and content-consistent satellite image generation. The code and models of this work will be released at https://opendatalab.github.io/skydiffusion

8/20/2024

Build-A-Scene: Interactive 3D Layout Control for Diffusion-Based Image Generation

Abdelrahman Eldesokey, Peter Wonka

We propose a diffusion-based approach for Text-to-Image (T2I) generation with interactive 3D layout control. Layout control has been widely studied to alleviate the shortcomings of T2I diffusion models in understanding objects' placement and relationships from text descriptions. Nevertheless, existing approaches for layout control are limited to 2D layouts, require the user to provide a static layout beforehand, and fail to preserve generated images under layout changes. This makes these approaches unsuitable for applications that require 3D object-wise control and iterative refinements, e.g., interior design and complex scene generation. To this end, we leverage the recent advancements in depth-conditioned T2I models and propose a novel approach for interactive 3D layout control. We replace the traditional 2D boxes used in layout control with 3D boxes. Furthermore, we revamp the T2I task as a multi-stage generation process, where at each stage, the user can insert, change, and move an object in 3D while preserving objects from earlier stages. We achieve this through our proposed Dynamic Self-Attention (DSA) module and the consistent 3D object translation strategy. Experiments show that our approach can generate complicated scenes based on 3D layouts, boosting the object generation success rate over the standard depth-conditioned T2I methods by 2x. Moreover, it outperforms other methods in comparison in preserving objects under layout changes. Project Page: url{https://abdo-eldesokey.github.io/build-a-scene/}

8/28/2024