STAR: Scale-wise Text-to-image generation via Auto-Regressive representations

Read original: arXiv:2406.10797 - Published 6/18/2024 by Xiaoxiao Ma, Mohan Zhou, Tao Liang, Yalong Bai, Tiejun Zhao, Huaian Chen, Yi Jin

Related Works

STAR: Scale-wise Text-to-image generation

Text-to-image generation is an active area of research, with several notable works exploring different approaches. Visual Autoregressive Modeling for Scalable Image Generation presents a technique for generating high-resolution images in a scalable manner. Generative Powers of Ten explores the potential of multi-scale generative models. Self-Taught Recognizer: Toward Unsupervised Adaptation in Speech demonstrates the effectiveness of self-supervised learning for speech recognition. TexIM: Fast Text-to-Image Representation Learning presents a technique for efficient text-to-image representation learning. Aggregated Text Transformer for Scene Text Detection explores the use of transformers for scene text detection.

Plain English Explanation

These related works explore different approaches to text-to-image generation, a task that involves creating images based on textual descriptions. The key ideas include:

Generating high-resolution images in a scalable way by using autoregressive models.
Leveraging multi-scale generative models to capture details at different scales.
Applying self-supervised learning techniques to improve speech recognition.
Efficiently learning text-to-image representations to enable faster model training and inference.
Using transformers, a powerful type of neural network, for scene text detection.

These works demonstrate the progress being made in text-to-image generation and related areas, paving the way for more advanced and practical applications.

Technical Explanation

The related works cover several important aspects of text-to-image generation and representation learning. Visual Autoregressive Modeling for Scalable Image Generation presents a technique for generating high-resolution images using an autoregressive model, which generates images one pixel at a time. Generative Powers of Ten explores multi-scale generative models, which can capture details at different scales to produce more realistic images.

Self-Taught Recognizer: Toward Unsupervised Adaptation in Speech demonstrates the effectiveness of self-supervised learning for speech recognition, where the model learns representations from unlabeled data. TexIM: Fast Text-to-Image Representation Learning presents a method for efficiently learning text-to-image representations, enabling faster model training and inference.

Finally, Aggregated Text Transformer for Scene Text Detection explores the use of transformers, a powerful type of neural network, for scene text detection, which is an important task for various applications.

Critical Analysis

The related works demonstrate the progress being made in text-to-image generation and related areas, but there are still several challenges and limitations to address. For example, the autoregressive models used for high-resolution image generation can be computationally expensive, and the multi-scale generative models may struggle to capture fine-grained details.

Additionally, the self-supervised learning techniques for speech recognition may not generalize well to other domains, and the efficient text-to-image representation learning methods may not be applicable to all types of text-to-image tasks.

Further research is needed to address these limitations and develop more robust and versatile text-to-image generation and representation learning techniques.

Conclusion

The related works highlight the significant progress being made in text-to-image generation and related areas, such as speech recognition and scene text detection. These advancements have the potential to enable a wide range of applications, from intelligent image editing tools to enhanced language understanding systems. However, there are still challenges to overcome, and continued research in this field is crucial for unlocking the full potential of these technologies.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

STAR: Scale-wise Text-to-image generation via Auto-Regressive representations

Xiaoxiao Ma, Mohan Zhou, Tao Liang, Yalong Bai, Tiejun Zhao, Huaian Chen, Yi Jin

We present STAR, a text-to-image model that employs scale-wise auto-regressive paradigm. Unlike VAR, which is limited to class-conditioned synthesis within a fixed set of predetermined categories, our STAR enables text-driven open-set generation through three key designs: To boost diversity and generalizability with unseen combinations of objects and concepts, we introduce a pre-trained text encoder to extract representations for textual constraints, which we then use as guidance. To improve the interactions between generated images and fine-grained textual guidance, making results more controllable, additional cross-attention layers are incorporated at each scale. Given the natural structure correlation across different scales, we leverage 2D Rotary Positional Encoding (RoPE) and tweak it into a normalized version. This ensures consistent interpretation of relative positions across token maps at different scales and stabilizes the training process. Extensive experiments demonstrate that STAR surpasses existing benchmarks in terms of fidelity,image text consistency, and aesthetic quality. Our findings emphasize the potential of auto-regressive methods in the field of high-quality image synthesis, offering promising new directions for the T2I field currently dominated by diffusion methods.

6/18/2024

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, Liwei Wang

We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine next-scale prediction or next-resolution prediction, diverging from the standard raster-scan next-token prediction. This simple, intuitive methodology allows autoregressive (AR) transformers to learn visual distributions fast and generalize well: VAR, for the first time, makes GPT-like AR models surpass diffusion transformers in image generation. On ImageNet 256x256 benchmark, VAR significantly improve AR baseline by improving Frechet inception distance (FID) from 18.65 to 1.73, inception score (IS) from 80.4 to 350.2, with around 20x faster inference speed. It is also empirically verified that VAR outperforms the Diffusion Transformer (DiT) in multiple dimensions including image quality, inference speed, data efficiency, and scalability. Scaling up VAR models exhibits clear power-law scaling laws similar to those observed in LLMs, with linear correlation coefficients near -0.998 as solid evidence. VAR further showcases zero-shot generalization ability in downstream tasks including image in-painting, out-painting, and editing. These results suggest VAR has initially emulated the two important properties of LLMs: Scaling Laws and zero-shot task generalization. We have released all models and codes to promote the exploration of AR/VAR models for visual generation and unified learning.

6/11/2024

MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

Wanggui He, Siming Fu, Mushui Liu, Xierui Wang, Wenyi Xiao, Fangxun Shu, Yi Wang, Lei Zhang, Zhelun Yu, Haoyuan Li, Ziwei Huang, LeiLei Gan, Hao Jiang

Auto-regressive models have made significant progress in the realm of language generation, yet they do not perform on par with diffusion models in the domain of image synthesis. In this work, we introduce MARS, a novel framework for T2I generation that incorporates a specially designed Semantic Vision-Language Integration Expert (SemVIE). This innovative component integrates pre-trained LLMs by independently processing linguistic and visual information, freezing the textual component while fine-tuning the visual component. This methodology preserves the NLP capabilities of LLMs while imbuing them with exceptional visual understanding. Building upon the powerful base of the pre-trained Qwen-7B, MARS stands out with its bilingual generative capabilities corresponding to both English and Chinese language prompts and the capacity for joint image and text generation. The flexibility of this framework lends itself to migration towards any-to-any task adaptability. Furthermore, MARS employs a multi-stage training strategy that first establishes robust image-text alignment through complementary bidirectional tasks and subsequently concentrates on refining the T2I generation process, significantly augmenting text-image synchrony and the granularity of image details. Notably, MARS requires only 9% of the GPU days needed by SD1.5, yet it achieves remarkable results across a variety of benchmarks, illustrating the training efficiency and the potential for swift deployment in various applications.

7/12/2024

VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling

Qian Zhang, Xiangzi Dai, Ninghua Yang, Xiang An, Ziyong Feng, Xingyu Ren

VAR is a new generation paradigm that employs 'next-scale prediction' as opposed to 'next-token prediction'. This innovative transformation enables auto-regressive (AR) transformers to rapidly learn visual distributions and achieve robust generalization. However, the original VAR model is constrained to class-conditioned synthesis, relying solely on textual captions for guidance. In this paper, we introduce VAR-CLIP, a novel text-to-image model that integrates Visual Auto-Regressive techniques with the capabilities of CLIP. The VAR-CLIP framework encodes captions into text embeddings, which are then utilized as textual conditions for image generation. To facilitate training on extensive datasets, such as ImageNet, we have constructed a substantial image-text dataset leveraging BLIP2. Furthermore, we delve into the significance of word positioning within CLIP for the purpose of caption guidance. Extensive experiments confirm VAR-CLIP's proficiency in generating fantasy images with high fidelity, textual congruence, and aesthetic excellence. Our project page are https://github.com/daixiangzi/VAR-CLIP

8/6/2024