TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction

Read original: arXiv:2405.16847 - Published 5/28/2024 by Yinda Chen, Haoyuan Shi, Xiaoyu Liu, Te Shi, Ruobing Zhang, Dong Liu, Zhiwei Xiong, Feng Wu

TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction

Overview

• This paper introduces TokenUnify, a scalable autoregressive visual pre-training approach with a novel "mixture token prediction" technique.

• The key ideas include using a unified token representation for efficient pre-training, and a mixture token prediction scheme to capture diverse visual patterns.

• The authors demonstrate the effectiveness of TokenUnify on various vision tasks, showing significant performance improvements compared to prior self-supervised approaches.

Plain English Explanation

TokenUnify is a new technique for pre-training deep learning models to better understand and work with visual data, like images and videos. The main innovations are:

Unified Token Representation: The model uses a single set of tokens to represent all visual information, which makes the pre-training process more efficient and scalable compared to previous approaches that used different token types.
Mixture Token Prediction: The model learns to predict a mixture, or combination, of tokens to represent the diverse visual patterns in the data. This allows the model to capture richer visual information compared to predicting a single token.

By using these new techniques, the authors show that TokenUnify can outperform previous self-supervised visual pre-training methods on a variety of computer vision tasks, like image classification and object detection. This suggests TokenUnify is a promising approach for building more capable and versatile computer vision models.

Technical Explanation

TokenUnify builds on recent advancements in self-supervised visual pre-training, such as SEIT and LLaVa. The key innovations are:

Unified Token Representation: Instead of using separate tokens for different visual elements (e.g. object tokens, texture tokens), TokenUnify uses a single set of learned tokens to represent all visual information. This unified token representation allows for more efficient and scalable pre-training.
Mixture Token Prediction: During pre-training, the model is tasked with predicting a mixture of tokens to reconstruct the input visual data, rather than predicting a single token. This mixture token prediction scheme enables the model to capture a broader range of visual patterns compared to previous approaches that predicted a single token.

The authors demonstrate the effectiveness of TokenUnify through extensive experiments on various vision benchmarks, including image classification, object detection, and semantic segmentation. They show that TokenUnify outperforms prior self-supervised pre-training methods, such as self-supervised pre-training for text recognizers and the emerging property of masked token, by a significant margin.

Critical Analysis

The paper presents a well-designed and thorough evaluation of the TokenUnify approach, considering a range of vision tasks and comparing against state-of-the-art self-supervised pre-training methods. The authors acknowledge some limitations, such as the computational cost of the mixture token prediction scheme and the potential for overfitting on certain visual patterns.

One area for further research could be exploring ways to reduce the computational overhead of the mixture token prediction, perhaps through more efficient token modeling or prediction techniques. Additionally, the authors could investigate the generalization of TokenUnify to other modalities beyond vision, such as self-supervised pre-training for text recognizers, to assess its broader applicability.

Overall, TokenUnify appears to be a promising advance in self-supervised visual pre-training, with the potential to drive further improvements in computer vision capabilities.

Conclusion

TokenUnify introduces a scalable autoregressive visual pre-training approach with a novel "mixture token prediction" technique. By using a unified token representation and a mixture-based prediction scheme, TokenUnify is able to capture diverse visual patterns more effectively than previous self-supervised methods. The demonstrated performance improvements across a range of vision tasks suggest that TokenUnify is a valuable contribution to the field of computer vision, with the potential to enable the development of more capable and versatile visual AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction

Yinda Chen, Haoyuan Shi, Xiaoyu Liu, Te Shi, Ruobing Zhang, Dong Liu, Zhiwei Xiong, Feng Wu

Autoregressive next-token prediction is a standard pretraining method for large-scale language models, but its application to vision tasks is hindered by the non-sequential nature of image data, leading to cumulative errors. Most vision models employ masked autoencoder (MAE) based pretraining, which faces scalability issues. To address these challenges, we introduce textbf{TokenUnify}, a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction. We provide theoretical evidence demonstrating that TokenUnify mitigates cumulative errors in visual autoregression. Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution, ideal for creating spatially correlated long sequences. This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date and providing a unified benchmark for experimental validation. Leveraging the Mamba network inherently suited for long-sequence modeling on this dataset, TokenUnify not only reduces the computational complexity but also leads to a significant 45% improvement in segmentation performance on downstream EM neuron segmentation tasks compared to existing methods. Furthermore, TokenUnify demonstrates superior scalability over MAE and traditional autoregressive methods, effectively bridging the gap between pretraining strategies for language and vision models. Code is available at url{https://github.com/ydchen0806/TokenUnify}.

5/28/2024

The pitfalls of next-token prediction

Gregor Bachmann, Vaishnavh Nagarajan

Can a mere next-token predictor faithfully model human intelligence? We crystallize this emerging concern and correct popular misconceptions surrounding it, and advocate a simple multi-token objective. As a starting point, we argue that the two often-conflated phases of next-token prediction -- autoregressive inference and teacher-forced training -- must be treated distinctly. The popular criticism that errors can compound during autoregressive inference, crucially assumes that teacher-forcing has learned an accurate next-token predictor. This assumption sidesteps a more deep-rooted problem we expose: in certain classes of tasks, teacher-forcing can simply fail to learn an accurate next-token predictor in the first place. We describe a general mechanism of how teacher-forcing can fail, and design a minimal planning task where both the Transformer and the Mamba architecture empirically fail in that manner -- remarkably, despite the task being straightforward to learn. Finally, we provide preliminary evidence that this failure can be resolved using a simple modification that predicts multiple tokens in advance. We hope this finding can ground future debates and inspire explorations beyond the next-token prediction paradigm. We make our code available under https://github.com/gregorbachmann/Next-Token-Failures

7/9/2024

🔎

Auto-Regressive Next-Token Predictors are Universal Learners

Eran Malach

Large language models display remarkable capabilities in logical and mathematical reasoning, allowing them to solve complex tasks. Interestingly, these abilities emerge in networks trained on the simple task of next-token prediction. In this work, we present a theoretical framework for studying auto-regressive next-token predictors. We demonstrate that even simple models such as linear next-token predictors, trained on Chain-of-Thought (CoT) data, can approximate any function efficiently computed by a Turing machine. We introduce a new complexity measure -- length complexity -- which measures the number of intermediate tokens in a CoT sequence required to approximate some target function, and analyze the interplay between length complexity and other notions of complexity. Finally, we show experimentally that simple next-token predictors, such as linear networks and shallow Multi-Layer Perceptrons (MLPs), display non-trivial performance on text generation and arithmetic tasks. Our results demonstrate that the power of today's LLMs can be attributed, to a great extent, to the auto-regressive next-token training scheme, and not necessarily to a particular choice of architecture.

7/31/2024

👀

Learning with Unmasked Tokens Drives Stronger Vision Learners

Taekyung Kim, Sanghyuk Chun, Byeongho Heo, Dongyoon Han

Masked image modeling (MIM) has become a leading self-supervised learning strategy. MIMs such as Masked Autoencoder (MAE) learn strong representations by randomly masking input tokens for the encoder to process, with the decoder reconstructing the masked tokens to the input. However, MIM pre-trained encoders often exhibit a limited attention span, attributed to MIM's sole focus on regressing masked tokens only, which may impede the encoder's broader context learning. To tackle the limitation, we improve MIM by explicitly incorporating unmasked tokens into the training process. Specifically, our method enables the encoder to learn from broader context supervision, allowing unmasked tokens to experience broader contexts while the decoder reconstructs masked tokens. Thus, the encoded unmasked tokens are equipped with extensive contextual information, empowering masked tokens to leverage the enhanced unmasked tokens for MIM. As a result, our simple remedy trains more discriminative representations revealed by achieving 84.2% top-1 accuracy with ViT-B on ImageNet-1K with 0.6%p gain. We attribute the success to the enhanced pre-training method, as evidenced by the singular value spectrum and attention analyses. Finally, our models achieve significant performance gains at the downstream semantic segmentation and fine-grained visual classification tasks; and on diverse robust evaluation metrics. Code is available at https://github.com/naver-ai/lut

8/27/2024