An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels


This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias -- locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a token and achieve highly performant results. This is substantially different from the popular design in Vision Transformer, which maintains the inductive bias from ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a token). We mainly showcase the effectiveness of pixels-as-tokens across three well-studied tasks in computer vision: supervised learning for object classification, self-supervised learning via masked autoencoding, and image generation with diffusion models. Although directly operating on individual pixels is less computationally practical, we believe the community must be aware of this surprising piece of knowledge when devising the next generation of neural architectures for computer vision.

  • This paper explores the use of Transformers, a type of deep learning model, for processing images at the individual pixel level rather than in larger 16x16 patches.
  • The researchers investigate whether this fine-grained approach can outperform the more common patch-based methods used in Transformer-based image models.
  • They evaluate the model's performance on tasks like image classification, reconstruction, and generation.

Plain English Explanation

In this paper, the researchers look at a different way of using Transformers, a type of AI model, to work with images. Typically, Transformer-based image models break the image up into larger 16x16 pixel patches and process them. But the researchers wondered if processing the image at the individual pixel level instead could lead to better performance.

So they built a model that looks at each individual pixel in an image, rather than larger chunks of pixels. They tested this model on tasks like identifying what's in an image, reconstructing an image from a partial or corrupted version, and generating new images.

The key idea is that by focusing on the smallest building blocks of the image (the individual pixels), the model might be able to capture more subtle details and patterns that get lost when you group pixels into larger patches. This could lead to better performance on various computer vision tasks.

The paper explores the pros and cons of this pixel-level approach compared to the more common patch-based methods used in Transformer-based image models. It provides insights into when and why the pixel-level approach might be advantageous.

Technical Explanation

The paper introduces a Transformer-based image model that operates directly on individual pixels, rather than the more common approach of processing the image in larger 16x16 pixel patches.

The researchers hypothesize that this fine-grained, pixel-level processing can capture more subtle details and patterns that get lost when using larger image patches. To test this, they evaluate the model's performance on tasks like image classification, reconstruction, and generation.

The model architecture consists of a Transformer encoder that processes each pixel individually, followed by task-specific heads for the different computer vision applications. The authors compare this pixel-level Transformer to more traditional patch-based Transformer models, as well as convolutional neural networks.

The experimental results show that in some cases, the pixel-level Transformer can outperform patch-based approaches, particularly on tasks that require fine-grained understanding of image content. However, the researchers also note that the computational cost of the pixel-level model is higher, which may limit its practical applicability.

Critical Analysis

The paper provides an interesting exploration of an alternative approach to using Transformers for computer vision tasks. The key idea of processing images at the individual pixel level rather than in larger patches is novel and could lead to performance improvements in certain scenarios.

However, the increased computational cost of the pixel-level model is a significant limitation that the authors acknowledge. In many practical applications, the computational efficiency of the model is just as important as its accuracy. The authors do not provide a clear sense of the trade-offs between the performance gains and the increased computational requirements.

Additionally, the paper does not delve deeply into the specific mechanisms by which the pixel-level approach confers advantages over patch-based methods. More analysis and explanation of the underlying reasons for the performance differences would help readers better understand the strengths and weaknesses of the proposed approach.

Further research could explore ways to mitigate the computational challenges of the pixel-level Transformer, such as through model optimization or the use of specialized hardware. Investigating the types of computer vision tasks where the pixel-level approach is most beneficial would also be a valuable direction for future work.


This paper presents a novel Transformer-based image model that operates at the individual pixel level, rather than the more common approach of processing larger image patches. The researchers find that in some cases, this fine-grained, pixel-level processing can outperform patch-based Transformer models, particularly on tasks that require a detailed understanding of image content.

However, the increased computational cost of the pixel-level model is a significant limitation that may restrict its practical applicability. Further research is needed to better understand the specific mechanisms behind the performance gains of the pixel-level approach and to explore ways to mitigate the computational challenges.

Overall, this paper offers an interesting perspective on the use of Transformers for computer vision and highlights the potential benefits of processing images at the lowest level of granularity. As the field of AI continues to evolve, innovative approaches like this may lead to new breakthroughs in image understanding and generation.

