Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers

Read original: arXiv:2402.12138 - Published 5/28/2024 by Markus Hiller, Krista A. Ehinger, Tom Drummond

🎲

Overview

This paper presents a novel bi-directional Transformer architecture called BiXT that is more efficient than traditional Transformers.
BiXT scales linearly with input size in terms of computational cost and memory consumption, but does not suffer the drop in performance or limitation to only one input modality seen with other efficient Transformer-based approaches.
BiXT is inspired by the Perceiver architectures but replaces iterative attention with an efficient bi-directional cross-attention module.

Plain English Explanation

The paper introduces a new type of Transformer model called BiXT that is more efficient than standard Transformers. Regular Transformers can struggle when processing very long sequences of data, like high-resolution images or point clouds, because they become computationally expensive and memory-intensive.

BiXT solves this problem by using a special type of attention mechanism that allows the model to efficiently process long inputs. Instead of having the input tokens and latent variables (internal representations) attend to each other one at a time, BiXT has them attend to each other simultaneously. This creates a natural "symmetry" in the attention, unlocking a key bottleneck experienced by similar efficient Transformer approaches like Perceiver.

By combining this efficient attention mechanism with the power of a full Transformer architecture, BiXT can process longer sequences, like images and point clouds, at higher resolutions while still achieving competitive performance. This makes it useful for a wide range of tasks, from image classification to document retrieval.

Technical Explanation

The key innovation in BiXT is the use of a bi-directional cross-attention module, which replaces the iterative attention mechanism used in Perceiver-like architectures. In this module, the input tokens and latent variables attend to each other simultaneously, rather than one-by-one. This allows the model to efficiently process long sequences without suffering the performance drops seen in other efficient Transformers.

The authors show that BiXT scales linearly with input size in terms of computational cost and memory consumption, outperforming larger Transformer models on tasks like image classification and segmentation. At the same time, BiXT performs on par with full Transformer variants on sequence modeling and document retrieval tasks, but requires 28% fewer FLOPs and can be up to 8.4x faster.

The authors attribute BiXT's strong performance to its ability to jointly develop representations of both the "what" (semantics) and "where" (location) information in the input, as the input tokens and latent variables attend to each other over multiple layers. This allows BiXT to be applied effectively to both dense and instance-based tasks.

Critical Analysis

The paper provides a thorough evaluation of BiXT across a wide range of tasks, demonstrating its strong performance and efficiency compared to larger Transformer models. However, the authors do not discuss any major limitations or caveats of their approach.

One potential concern is the complexity of the bi-directional cross-attention module, which could make the model more difficult to train or interpret compared to simpler attention mechanisms. The authors also do not explore the effects of scaling BiXT to even larger input sizes or deeper model depths.

Additionally, while the paper highlights BiXT's ability to process long sequences, it would be valuable to see how the model performs on truly massive inputs, such as high-resolution video or full-size documents, and how its efficiency and accuracy scales in those regimes.

Conclusion

The BiXT model presented in this paper represents an important advancement in efficient Transformer architectures. By introducing a novel bi-directional cross-attention mechanism, the authors have developed a model that can process long sequences, like images and point clouds, more efficiently than traditional Transformers without sacrificing performance.

The strong results across a diverse set of tasks suggest that BiXT could have wide-ranging applications in computer vision, natural language processing, and beyond. As the field continues to grapple with the computational demands of increasingly large and complex inputs, innovations like BiXT will be crucial for enabling the deployment of powerful AI models in real-world scenarios.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🎲

Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers

Markus Hiller, Krista A. Ehinger, Tom Drummond

We present a novel bi-directional Transformer architecture (BiXT) which scales linearly with input size in terms of computational cost and memory consumption, but does not suffer the drop in performance or limitation to only one input modality seen with other efficient Transformer-based approaches. BiXT is inspired by the Perceiver architectures but replaces iterative attention with an efficient bi-directional cross-attention module in which input tokens and latent variables attend to each other simultaneously, leveraging a naturally emerging attention-symmetry between the two. This approach unlocks a key bottleneck experienced by Perceiver-like architectures and enables the processing and interpretation of both semantics ('what') and location ('where') to develop alongside each other over multiple layers -- allowing its direct application to dense and instance-based tasks alike. By combining efficiency with the generality and performance of a full Transformer architecture, BiXT can process longer sequences like point clouds, text or images at higher feature resolutions and achieves competitive performance across a range of tasks like point cloud part segmentation, semantic image segmentation, image classification, hierarchical sequence modeling and document retrieval. Our experiments demonstrate that BiXT models outperform larger competitors by leveraging longer sequences more efficiently on vision tasks like classification and segmentation, and perform on par with full Transformer variants on sequence modeling and document retrieval -- but require $28%$ fewer FLOPs and are up to $8.4times$ faster.

5/28/2024

👀

Sequence Length Scaling in Vision Transformers for Scientific Images on Frontier

Aristeidis Tsaris, Chengming Zhang, Xiao Wang, Junqi Yin, Siyan Liu, Moetasim Ashfaq, Ming Fan, Jong Youl Choi, Mohamed Wahib, Dan Lu, Prasanna Balaprakash, Feiyi Wang

Vision Transformers (ViTs) are pivotal for foundational models in scientific imagery, including Earth science applications, due to their capability to process large sequence lengths. While transformers for text has inspired scaling sequence lengths in ViTs, yet adapting these for ViTs introduces unique challenges. We develop distributed sequence parallelism for ViTs, enabling them to handle up to 1M tokens. Our approach, leveraging DeepSpeed-Ulysses and Long-Sequence-Segmentation with model sharding, is the first to apply sequence parallelism in ViT training, achieving a 94% batch scaling efficiency on 2,048 AMD-MI250X GPUs. Evaluating sequence parallelism in ViTs, particularly in models up to 10B parameters, highlighted substantial bottlenecks. We countered these with hybrid sequence, pipeline, tensor parallelism, and flash attention strategies, to scale beyond single GPU memory limits. Our method significantly enhances climate modeling accuracy by 20% in temperature predictions, marking the first training of a transformer model on a full-attention matrix over 188K sequence length.

5/28/2024

You Only Need Less Attention at Each Stage in Vision Transformers

Shuoxi Zhang, Hanpeng Liu, Stephen Lin, Kun He

The advent of Vision Transformers (ViTs) marks a substantial paradigm shift in the realm of computer vision. ViTs capture the global information of images through self-attention modules, which perform dot product computations among patchified image tokens. While self-attention modules empower ViTs to capture long-range dependencies, the computational complexity grows quadratically with the number of tokens, which is a major hindrance to the practical application of ViTs. Moreover, the self-attention mechanism in deep ViTs is also susceptible to the attention saturation issue. Accordingly, we argue against the necessity of computing the attention scores in every layer, and we propose the Less-Attention Vision Transformer (LaViT), which computes only a few attention operations at each stage and calculates the subsequent feature alignments in other layers via attention transformations that leverage the previously calculated attention scores. This novel approach can mitigate two primary issues plaguing traditional self-attention modules: the heavy computational burden and attention saturation. Our proposed architecture offers superior efficiency and ease of implementation, merely requiring matrix multiplications that are highly optimized in contemporary deep learning frameworks. Moreover, our architecture demonstrates exceptional performance across various vision tasks including classification, detection and segmentation.

6/4/2024

Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis

Th'eodor Lemerle, Nicolas Obin, Axel Roebel

Recent advancements in text-to-speech (TTS) powered by language models have showcased remarkable capabilities in achieving naturalness and zero-shot voice cloning. Notably, the decoder-only transformer is the prominent architecture in this domain. However, transformers face challenges stemming from their quadratic complexity in sequence length, impeding training on lengthy sequences and resource-constrained hardware. Moreover they lack specific inductive bias with regards to the monotonic nature of TTS alignments. In response, we propose to replace transformers with emerging recurrent architectures and introduce specialized cross-attention mechanisms for reducing repeating and skipping issues. Consequently our architecture can be efficiently trained on long samples and achieve state-of-the-art zero-shot voice cloning against baselines of comparable size. Our implementation and demos are available at https://github.com/theodorblackbird/lina-speech.

6/12/2024