Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

2310.05737

Published 4/1/2024 by Lijun Yu, Jos'e Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu and 6 others

cs.CV cs.AI cs.MM

💬

Abstract

While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation. To effectively use LLMs for visual generation, one crucial component is the visual tokenizer that maps pixel-space inputs to discrete tokens appropriate for LLM learning. In this paper, we introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary. Equipped with this new tokenizer, we show that LLMs outperform diffusion models on standard image and video generation benchmarks including ImageNet and Kinetics. In addition, we demonstrate that our tokenizer surpasses the previously top-performing video tokenizer on two more tasks: (1) video compression comparable to the next-generation video codec (VCC) according to human evaluations, and (2) learning effective representations for action recognition tasks.

Create account to get full access

Overview

Large Language Models (LLMs) are powerful for generative language tasks but struggle with image and video generation compared to diffusion models.
Effective use of LLMs for visual generation requires a visual tokenizer that can map pixel-space inputs to discrete tokens suitable for LLM learning.
The paper introduces MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images using a common token vocabulary.
With MAGVIT-v2, LLMs outperform diffusion models on standard image and video generation benchmarks like ImageNet and Kinetics.
The tokenizer also surpasses previous top-performing video tokenizers in video compression and action recognition tasks.

Plain English Explanation

Imagine you want to teach a language model to understand and generate images or videos. Just like how language models learn by breaking down text into individual words or tokens, visual data needs to be broken down into smaller, meaningful units that the model can comprehend. However, existing methods for converting visual data into tokens struggle to capture the nuances and complexities of images and videos effectively.

The researchers behind this paper have developed a new visual tokenizer called MAGVIT-v2. Think of it as a specialized translator that can take raw pixel data from images and videos and convert it into a sequence of tokens that a language model can easily understand. What makes MAGVIT-v2 unique is its ability to generate concise yet expressive tokens that accurately represent the visual information, whether it's a still image or a moving video.

By using this improved tokenizer, the researchers found that language models could outperform specialized diffusion models (which are typically better at generating images and videos) on standard benchmarks for image and video generation tasks. It's like giving the language model a pair of glasses that allows it to see visual data more clearly, enabling it to generate higher-quality images and videos.

But that's not all – MAGVIT-v2 also demonstrated remarkable performance in other visual tasks, such as video compression and action recognition. For video compression, the tokenizer produced compressed videos that were comparable in quality to next-generation video codecs, as judged by human evaluators. In action recognition tasks, the tokenizer helped language models learn effective representations of actions in videos, improving their ability to recognize and classify different activities.

Technical Explanation

The paper introduces MAGVIT-v2, a visual tokenizer designed to generate concise and expressive tokens for both images and videos using a common token vocabulary. The tokenizer is based on the Masked Autogressive Video-Image Transformer (MAGVIT) architecture, which was introduced in a previous work.

MAGVIT-v2 employs a two-stage tokenization process. In the first stage, the input image or video is split into non-overlapping spatial patches. These patches are then linearly projected into a sequence of patch tokens. In the second stage, these patch tokens are processed by a transformer encoder, which generates a new set of tokens that capture long-range dependencies and higher-level visual features.

The researchers conducted experiments on standard image and video generation benchmarks, including ImageNet and Kinetics. They trained large language models (LLMs) using the tokens generated by MAGVIT-v2 and compared their performance against diffusion models, which are typically considered state-of-the-art for visual generation tasks.

Surprisingly, the LLMs equipped with MAGVIT-v2 tokenization outperformed diffusion models on both image and video generation tasks. This suggests that LLMs, when provided with effective visual tokenization, can excel at generating high-quality visual data, challenging the dominance of diffusion models in this domain.

In addition to generation tasks, the researchers evaluated MAGVIT-v2 on video compression and action recognition tasks. For video compression, they found that MAGVIT-v2 tokenization achieved compression performance comparable to the next-generation video codec (VCC), as judged by human evaluators. In action recognition tasks, MAGVIT-v2 helped LLMs learn effective representations for recognizing and classifying actions in videos.

Critical Analysis

While the results presented in the paper are promising, there are several limitations and areas for further research that warrant consideration.

Firstly, the experiments were conducted on standard benchmarks, which may not fully capture the complexities and diverse visual domains encountered in real-world applications. It would be valuable to evaluate the performance of MAGVIT-v2 and the LLMs trained on its tokens across a broader range of visual tasks and datasets.

Secondly, the computational requirements and resource demands of training large language models with visual tokenization are not extensively discussed in the paper. These factors could pose practical challenges for widespread adoption and deployment of such models, particularly in resource-constrained environments.

Another potential limitation is the lack of explicit consideration for the interpretability and explainability of the visual tokenization process. While the tokenizer may produce effective representations for generation and recognition tasks, it is unclear how interpretable these representations are and whether they can provide insights into the underlying visual features and patterns.

Furthermore, the paper does not delve into the potential biases or ethical considerations that may arise from the use of large language models for visual generation tasks. As these models become more capable of generating realistic and convincing visual content, it is crucial to address issues such as deepfakes, privacy concerns, and the potential for misuse or manipulation.

Conclusion

The introduction of MAGVIT-v2, a powerful visual tokenizer, has demonstrated the potential of large language models to excel at image and video generation tasks, challenging the dominance of diffusion models in this domain. By effectively converting visual data into concise and expressive tokens, MAGVIT-v2 enables language models to leverage their capabilities in understanding and generating visual content.

Beyond generation tasks, the tokenizer has shown promising results in video compression and action recognition, suggesting its versatility and potential for various visual applications. However, it is important to critically evaluate the limitations and potential drawbacks of this approach, such as the computational requirements, interpretability concerns, and ethical considerations surrounding the generation of realistic visual content.

As research in this field continues to advance, it will be crucial to strike a balance between pushing the boundaries of visual generation capabilities and addressing the societal implications and potential risks associated with these powerful technologies. Responsible development and deployment of visual tokenizers and language models will be essential to harness their potential while mitigating potential harms.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

📊

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Yang Jin, Zhicheng Sun, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, Yadong Mu

In light of recent advances in multimodal Large Language Models (LLMs), there is increasing attention to scaling them from image-text data to more informative real-world videos. Compared to static images, video poses unique challenges for effective large-scale pre-training due to the modeling of its spatiotemporal dynamics. In this paper, we address such limitations in video-language pre-training with an efficient video decomposition that represents each video as keyframes and temporal motions. These are then adapted to an LLM using well-designed tokenizers that discretize visual and temporal information as a few tokens, thus enabling unified generative pre-training of videos, images, and text. At inference, the generated tokens from the LLM are carefully recovered to the original continuous pixel space to create various video content. Our proposed framework is both capable of comprehending and generating image and video content, as demonstrated by its competitive performance across 13 multimodal benchmarks in image and video understanding and generation. Our code and models are available at https://video-lavit.github.io.

6/4/2024

cs.CV cs.CL

An Image is Worth 32 Tokens for Reconstruction and Generation

Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, Liang-Chieh Chen

Recent advancements in generative models have highlighted the crucial role of image tokenization in the efficient synthesis of high-resolution images. Tokenization, which transforms images into latent representations, reduces computational demands compared to directly processing pixels and enhances the effectiveness and efficiency of the generation process. Prior methods, such as VQGAN, typically utilize 2D latent grids with fixed downsampling factors. However, these 2D tokenizations face challenges in managing the inherent redundancies present in images, where adjacent regions frequently display similarities. To overcome this issue, we introduce Transformer-based 1-Dimensional Tokenizer (TiTok), an innovative approach that tokenizes images into 1D latent sequences. TiTok provides a more compact latent representation, yielding substantially more efficient and effective representations than conventional techniques. For example, a 256 x 256 x 3 image can be reduced to just 32 discrete tokens, a significant reduction from the 256 or 1024 tokens obtained by prior methods. Despite its compact nature, TiTok achieves competitive performance to state-of-the-art approaches. Specifically, using the same generator framework, TiTok attains 1.97 gFID, outperforming MaskGIT baseline significantly by 4.21 at ImageNet 256 x 256 benchmark. The advantages of TiTok become even more significant when it comes to higher resolution. At ImageNet 512 x 512 benchmark, TiTok not only outperforms state-of-the-art diffusion model DiT-XL/2 (gFID 2.74 vs. 3.04), but also reduces the image tokens by 64x, leading to 410x faster generation process. Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.

6/12/2024

cs.CV

🌐

Towards Multi-Task Multi-Modal Models: A Video Generative Perspective

Lijun Yu

Advancements in language foundation models have primarily fueled the recent surge in artificial intelligence. In contrast, generative learning of non-textual modalities, especially videos, significantly trails behind language modeling. This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions, as well as for understanding and compression applications. Given the high dimensionality of visual data, we pursue concise and accurate latent representations. Our video-native spatial-temporal tokenizers preserve high fidelity. We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms. Furthermore, our scalable visual token representation proves beneficial across generation, compression, and understanding tasks. This achievement marks the first instances of language models surpassing diffusion models in visual synthesis and a video tokenizer outperforming industry-standard codecs. Within these multi-modal latent spaces, we study the design of multi-task generative models. Our masked multi-task transformer excels at the quality, efficiency, and flexibility of video generation. We enable a frozen language model, trained solely on text, to generate visual content. Finally, we build a scalable generative multi-modal transformer trained from scratch, enabling the generation of videos containing high-fidelity motion with the corresponding audio given diverse conditions. Throughout the course, we have shown the effectiveness of integrating multiple tasks, crafting high-fidelity latent representation, and generating multiple modalities. This work suggests intriguing potential for future exploration in generating non-textual data and enabling real-time, interactive experiences across various media forms.

5/28/2024

cs.CV cs.AI cs.LG cs.MM

Distilling Vision-Language Models on Millions of Videos

Yue Zhao, Long Zhao, Xingyi Zhou, Jialin Wu, Chun-Te Chu, Hui Miao, Florian Schroff, Hartwig Adam, Ting Liu, Boqing Gong, Philipp Krahenbuhl, Liangzhe Yuan

The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-language models. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%. As a side product, we generate the largest video caption dataset to date.

4/17/2024

cs.CV