Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models

Read original: arXiv:2311.17919 - Published 4/4/2024 by Daniel Geng, Inbum Park, Andrew Owens

🔎

Overview

The paper addresses the problem of synthesizing multi-view optical illusions, which are images that change appearance when transformed, such as by flipping or rotating.
The authors propose a simple, zero-shot method for generating these illusions using off-the-shelf text-to-image diffusion models.
The method estimates noise from different views of a noisy image and combines these estimates to denoise the image, leading to an image that changes appearance under specific transformations.
This includes not just rotations and flips, but also more exotic pixel rearrangements like a jigsaw puzzle.
The approach can produce illusions with more than two views.

Plain English Explanation

The paper describes a way to create optical illusions where an image changes its appearance when you do something to it, like flipping or rotating it. The authors came up with a simple method that uses existing AI models that can generate images from text descriptions.

The key idea is that when you take a noisy, blurry image and gradually clean it up, the noise patterns from different viewpoints of the image can be combined in a way that leads to the final image changing its look when transformed. For example, an image might look like one thing when upright, but then look like something else when flipped or rotated.

This isn't limited to just flips and rotations - the approach can also handle more complex pixel rearrangements, like shuffling the pieces of the image around like a jigsaw puzzle. And it can even create illusions with more than two different views.

The authors show that this technique works by analyzing the math behind it, and they provide examples demonstrating the effectiveness and versatility of their method.

Technical Explanation

The core of the authors' approach is to leverage the noise estimation and denoising process in text-to-image diffusion models. During the reverse diffusion process, where a noisy image is gradually cleaned up, the authors estimate the noise patterns from different views (e.g. rotations, flips) of the image.

They then combine these noise estimates in a specific way and use them to denoise the image. This results in an image that changes its appearance under the corresponding transformations, creating the desired optical illusion.

Theoretically, the authors show that this method works for any transformations that can be written as orthogonal matrices, which includes common operations like rotations and flips, as well as more exotic pixel permutations.

This insight leads to the concept of a "visual anagram" - an image that changes its appearance when the pixels are rearranged in a specific way, similar to how rearranging the letters in a word can create a new word.

The authors demonstrate their method producing a variety of multi-view optical illusions, both qualitatively and through quantitative evaluations. They show it can handle more than two views, and provide additional results and visualizations on their project webpage.

Critical Analysis

The paper presents a clever and flexible approach for synthesizing multi-view optical illusions using text-to-image diffusion models. The key theoretical insight - that the method works for any orthogonal transformations - is a nice result that unlocks a wide range of potential visual effects beyond just flips and rotations.

That said, the paper does not dive deeply into potential limitations or caveats of the approach. For example, it's unclear how the method would scale to higher-resolution images, or how the generated illusions would hold up under close scrutiny.

Additionally, the authors don't explore the perceptual qualities of the resulting illusions - how compelling or convincing they are to the human eye, and whether there are ways to further optimize the illusion effects.

Further research could also investigate potential applications of this technique beyond just visual novelty, such as in art, design, or even security applications (e.g. creating forgery-resistant images).

Overall, this is a technically solid piece of research that introduces an intriguing new direction for optical illusion synthesis. With further refinement and exploration of the capabilities and limitations, it could lead to interesting real-world uses.

Conclusion

This paper presents a simple yet powerful method for synthesizing multi-view optical illusions using off-the-shelf text-to-image diffusion models. By leveraging the noise estimation and denoising process in these models, the authors are able to create images that change their appearance under specific transformations, including not just flips and rotations, but also more exotic pixel rearrangements.

The key theoretical insight - that the method works for any orthogonal transformations - unlocks a wide range of potential visual effects and the ability to generate "visual anagrams". The authors demonstrate the effectiveness and flexibility of their approach through qualitative and quantitative results.

While the paper doesn't delve deeply into limitations or potential real-world applications, it introduces an intriguing new direction for optical illusion synthesis that could lead to interesting developments in areas like art, design, and even security. With further refinement and exploration, this technique could enable the creation of increasingly compelling and versatile visual illusions.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔎

Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models

Daniel Geng, Inbum Park, Andrew Owens

We address the problem of synthesizing multi-view optical illusions: images that change appearance upon a transformation, such as a flip or rotation. We propose a simple, zero-shot method for obtaining these illusions from off-the-shelf text-to-image diffusion models. During the reverse diffusion process, we estimate the noise from different views of a noisy image, and then combine these noise estimates together and denoise the image. A theoretical analysis suggests that this method works precisely for views that can be written as orthogonal transformations, of which permutations are a subset. This leads to the idea of a visual anagram--an image that changes appearance under some rearrangement of pixels. This includes rotations and flips, but also more exotic pixel permutations such as a jigsaw rearrangement. Our approach also naturally extends to illusions with more than two views. We provide both qualitative and quantitative results demonstrating the effectiveness and flexibility of our method. Please see our project webpage for additional visualizations and results: https://dangeng.github.io/visual_anagrams/

4/4/2024

Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors

Ruicheng Wang, Jianfeng Xiang, Jiaolong Yang, Xin Tong

We propose a novel image editing technique that enables 3D manipulations on single images, such as object rotation and translation. Existing 3D-aware image editing approaches typically rely on synthetic multi-view datasets for training specialized models, thus constraining their effectiveness on open-domain images featuring significantly more varied layouts and styles. In contrast, our method directly leverages powerful image diffusion models trained on a broad spectrum of text-image pairs and thus retain their exceptional generalization abilities. This objective is realized through the development of an iterative novel view synthesis and geometry alignment algorithm. The algorithm harnesses diffusion models for dual purposes: they provide appearance prior by predicting novel views of the selected object using estimated depth maps, and they act as a geometry critic by correcting misalignments in 3D shapes across the sampled views. Our method can generate high-quality 3D-aware image edits with large viewpoint transformations and high appearance and shape consistency with the input image, pushing the boundaries of what is possible with single-image 3D-aware editing.

7/16/2024

TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models

Gilad Deutch, Rinon Gal, Daniel Garibi, Or Patashnik, Daniel Cohen-Or

Diffusion models have opened the path to a wide range of text-based image editing frameworks. However, these typically build on the multi-step nature of the diffusion backwards process, and adapting them to distilled, fast-sampling methods has proven surprisingly challenging. Here, we focus on a popular line of text-based editing frameworks - the ``edit-friendly'' DDPM-noise inversion approach. We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength. We trace the artifacts to mismatched noise statistics between inverted noises and the expected noise schedule, and suggest a shifted noise schedule which corrects for this offset. To increase editing strength, we propose a pseudo-guidance approach that efficiently increases the magnitude of edits without introducing new artifacts. All in all, our method enables text-based image editing with as few as three diffusion steps, while providing novel insights into the mechanisms behind popular text-based editing approaches.

8/2/2024

🌀

FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models by Inverting Stable Diffusion

George Cazenavette, Avneesh Sud, Thomas Leung, Ben Usman

Due to the high potential for abuse of GenAI systems, the task of detecting synthetic images has recently become of great interest to the research community. Unfortunately, existing image-space detectors quickly become obsolete as new high-fidelity text-to-image models are developed at blinding speed. In this work, we propose a new synthetic image detector that uses features obtained by inverting an open-source pre-trained Stable Diffusion model. We show that these inversion features enable our detector to generalize well to unseen generators of high visual fidelity (e.g., DALL-E 3) even when the detector is trained only on lower fidelity fake images generated via Stable Diffusion. This detector achieves new state-of-the-art across multiple training and evaluation setups. Moreover, we introduce a new challenging evaluation protocol that uses reverse image search to mitigate stylistic and thematic biases in the detector evaluation. We show that the resulting evaluation scores align well with detectors' in-the-wild performance, and release these datasets as public benchmarks for future research.

6/14/2024