Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Read original: arXiv:2406.06525 - Published 6/11/2024 by Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, Zehuan Yuan

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Overview

This paper introduces Llama, a novel autoregressive model for scalable image generation that outperforms diffusion models.
Llama uses a hierarchical architecture to capture global and local image structure, allowing it to generate high-quality images more efficiently than diffusion models.
The authors demonstrate Llama's capabilities on a range of image generation tasks, showcasing its ability to generate diverse and realistic images.

Plain English Explanation

The paper presents a new type of machine learning model called Llama that can generate high-quality images. Unlike diffusion models, which have been popular for image generation, Llama uses a different approach called autoregression.

Autoregressive models work by predicting the next pixel in an image based on the pixels that have already been generated. Llama takes this a step further by using a hierarchical structure, which means it can capture both the overall shape and finer details of an image. This allows Llama to generate images that are more realistic and diverse than those produced by diffusion models.

The researchers demonstrate Llama's capabilities on a variety of image generation tasks, showing that it can create high-quality images in a more efficient way than diffusion models. This could be useful for applications like image editing, content creation, and visual art generation.

Technical Explanation

The paper introduces Llama, a novel autoregressive model for scalable image generation. Autoregressive models work by predicting the next pixel in an image based on the pixels that have already been generated, in contrast to diffusion models that generate images in a more iterative way.

Llama uses a hierarchical architecture to capture both global and local image structure. It has multiple levels of "resolution," where each level predicts the next set of pixels based on the previous level. This allows Llama to efficiently generate high-quality images by first focusing on the overall shape and then gradually adding finer details.

The authors evaluate Llama on a range of image generation tasks, including unconditional generation, conditional generation, and super-resolution. They show that Llama outperforms state-of-the-art diffusion models in terms of both image quality and generation speed. Llama is able to generate diverse and realistic images while requiring fewer computational resources than diffusion models.

Critical Analysis

The paper provides a compelling argument for the use of autoregressive models like Llama for scalable image generation. The hierarchical architecture is a clever way to combine global and local structure, and the authors demonstrate impressive results compared to diffusion models.

However, the paper doesn't fully address potential limitations of Llama. For example, autoregressive models can sometimes suffer from "exposure bias," where the model's predictions are influenced by its own previous outputs rather than the true data distribution. This could lead to the generation of less diverse or realistic images over time.

Additionally, the paper doesn't discuss the training process or hyperparameter tuning in depth. It would be helpful to understand the challenges the authors faced in optimizing Llama and how they overcame them.

Further research could also explore ways to combine the strengths of autoregressive and diffusion models, as suggested by some recent work like KaleIDo. This could potentially yield even more powerful and flexible image generation capabilities.

Conclusion

The Llama paper presents a novel autoregressive model that outperforms state-of-the-art diffusion models for scalable image generation. By using a hierarchical architecture, Llama is able to efficiently capture both global and local image structure, leading to the generation of diverse and realistic images.

While the paper doesn't address all potential limitations of the approach, it makes a compelling case for the use of autoregressive models in image generation. Llama's strong performance on a variety of tasks suggests that it could be a valuable tool for applications like image editing, content creation, and visual art generation.

As the field of generative AI continues to evolve, it will be interesting to see how researchers build upon the ideas presented in this paper to further advance the state of the art in image generation.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, Zehuan Yuan

We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spaces of image tokenizers, scalability properties of image generation models, and their training data quality. The outcome of this exploration consists of: (1) An image tokenizer with downsample ratio of 16, reconstruction quality of 0.94 rFID and codebook usage of 97% on ImageNet benchmark. (2) A series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256x256 benchmarks, outperforming the popular diffusion models such as LDM, DiT. (3) A text-conditional image generation model with 775M parameters, from two-stage training on LAION-COCO and high aesthetics quality images, demonstrating competitive performance of visual quality and text alignment. (4) We verify the effectiveness of LLM serving frameworks in optimizing the inference speed of image generation models and achieve 326% - 414% speedup. We release all models and codes to facilitate open-source community of visual generation and multimodal foundation models.

6/11/2024

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

225

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

Xiaochuang Han, Marjan Ghazvininejad, Pang Wei Koh, Yulia Tsvetkov

Recent work in image and video generation has been adopting the autoregressive LLM architecture due to its generality and potentially easy integration into multi-modal systems. The crux of applying autoregressive training in language generation to visual generation is discretization -- representing continuous data like images and videos as discrete tokens. Common methods of discretizing images and videos include modeling raw pixel values, which are prohibitively lengthy, or vector quantization, which requires convoluted pre-hoc training. In this work, we propose to directly model images and videos as compressed files saved on computers via canonical codecs (e.g., JPEG, AVC/H.264). Using the default Llama architecture without any vision-specific modifications, we pretrain JPEG-LM from scratch to generate images (and AVC-LM to generate videos as a proof of concept), by directly outputting compressed file bytes in JPEG and AVC formats. Evaluation of image generation shows that this simple and straightforward approach is more effective than pixel-based modeling and sophisticated vector quantization baselines (on which our method yields a 31% reduction in FID). Our analysis shows that JPEG-LM has an especial advantage over vector quantization models in generating long-tail visual elements. Overall, we show that using canonical codec representations can help lower the barriers between language generation and visual generation, facilitating future research on multi-modal language/image/video LLMs.

8/22/2024

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, Liwei Wang

We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine next-scale prediction or next-resolution prediction, diverging from the standard raster-scan next-token prediction. This simple, intuitive methodology allows autoregressive (AR) transformers to learn visual distributions fast and generalize well: VAR, for the first time, makes GPT-like AR models surpass diffusion transformers in image generation. On ImageNet 256x256 benchmark, VAR significantly improve AR baseline by improving Frechet inception distance (FID) from 18.65 to 1.73, inception score (IS) from 80.4 to 350.2, with around 20x faster inference speed. It is also empirically verified that VAR outperforms the Diffusion Transformer (DiT) in multiple dimensions including image quality, inference speed, data efficiency, and scalability. Scaling up VAR models exhibits clear power-law scaling laws similar to those observed in LLMs, with linear correlation coefficients near -0.998 as solid evidence. VAR further showcases zero-shot generalization ability in downstream tasks including image in-painting, out-painting, and editing. These results suggest VAR has initially emulated the two important properties of LLMs: Scaling Laws and zero-shot task generalization. We have released all models and codes to promote the exploration of AR/VAR models for visual generation and unified learning.

6/11/2024

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

Yuchi Wang, Shuhuai Ren, Rundong Gao, Linli Yao, Qingyan Guo, Kaikai An, Jianhong Bai, Xu Sun

Diffusion models have exhibited remarkable capabilities in text-to-image generation. However, their performance in image-to-text generation, specifically image captioning, has lagged behind Auto-Regressive (AR) models, casting doubt on their applicability for such tasks. In this work, we revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding. With these benefits, diffusion models can alleviate the inherent limitations of AR methods, including their slow inference speed, error propagation, and unidirectional constraints. Furthermore, we identify the prior underperformance of diffusion models stemming from the absence of an effective latent space for image-text alignment, and the discrepancy between continuous diffusion processes and discrete textual data. In response, we introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions and integrates a regularization module to manage varying text lengths. Our framework also includes a diffuser for semantic image-to-text conversion and a Back&Refine technique to enhance token interactivity during inference. LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS COCO dataset with 38.2 BLEU@4 and 126.2 CIDEr, demonstrating exceptional performance without pre-training or ancillary modules. This indicates strong competitiveness with AR models, revealing the previously untapped potential of diffusion models in image-to-text generation.

4/17/2024