Transformers as Transducers

2404.02040

Published 4/3/2024 by Lena Strobl, Dana Angluin, David Chiang, Jonathan Rawski, Ashish Sabharwal

🎲

Abstract

We study the sequence-to-sequence mapping capacity of transformers by relating them to finite transducers, and find that they can express surprisingly large classes of transductions. We do so using variants of RASP, a programming language designed to help people think like transformers, as an intermediate representation. We extend the existing Boolean variant B-RASP to sequence-to-sequence functions and show that it computes exactly the first-order rational functions (such as string rotation). Then, we introduce two new extensions. B-RASP[pos] enables calculations on positions (such as copying the first half of a string) and contains all first-order regular functions. S-RASP adds prefix sum, which enables additional arithmetic operations (such as squaring a string) and contains all first-order polyregular functions. Finally, we show that masked average-hard attention transformers can simulate S-RASP. A corollary of our results is a new proof that transformer decoders are Turing-complete.

Get summaries of the top AI research delivered straight to your inbox:

Overview

This paper explores how Transformer models, a type of deep learning architecture, can be viewed as a specific kind of "transducer" - a device that converts one form of information into another.
The authors introduce a new theoretical framework called "B-RASP" that provides a formal mathematical model for understanding Transformer models as transducers.
The paper delves into the technical details of how this framework can be used to analyze the capabilities and limitations of Transformer models.

Plain English Explanation

Transformer models have become incredibly powerful and widely-used in fields like natural language processing and computer vision. But what exactly are these models doing under the hood? The authors of this paper propose a new way of thinking about Transformers as a special type of information "transducer."

Just like a microphone transduces sound waves into electrical signals, or a speaker transduces electrical signals back into sound waves, Transformer models take in one kind of information (like words or images) and transform it into another kind of information (like predictions or translations). The B-RASP framework developed in this paper provides a mathematical lens for understanding this transduction process in detail.

By modeling Transformers as transducers, the authors are able to shed light on their fundamental capabilities and limitations. For example, the framework reveals that Transformers are essentially limited to performing a certain class of "regular" transformations on their inputs. This insight could help guide the development of even more powerful AI models in the future.

Technical Explanation

The core contribution of this paper is the introduction of a new theoretical framework called "B-RASP" (Bounded-RASP) for analyzing Transformer models as transducers. B-RASP builds on the concept of "regular subsets" from formal language theory, providing a formal mathematical model for the kinds of transformations that Transformer models can perform.

Specifically, the authors show that the attention mechanism at the heart of Transformer models can be viewed as a type of "regular transducer." This means that Transformers are fundamentally limited to performing transformations that can be described by regular languages - a well-understood class of formal languages with certain mathematical properties.

The authors then leverage this B-RASP framework to derive several technical insights about the capabilities of Transformer models. For example, they prove that Transformers cannot perform certain types of "non-regular" transformations, like language translation between radically different language pairs. They also explore the implications of this framework for the training and optimization of Transformer models.

Critical Analysis

The B-RASP framework introduced in this paper provides a valuable new lens for understanding the fundamental capabilities and limitations of Transformer models. By modeling Transformers as transducers, the authors are able to draw insights that could help guide the development of even more powerful AI architectures in the future.

That said, it's important to note that the B-RASP framework is a highly theoretical construct, and the authors acknowledge that directly applying it to real-world Transformer models may be challenging. There are likely many practical considerations and complexities that aren't fully captured by the mathematical model.

Additionally, while the authors demonstrate some technical limitations of Transformers within the B-RASP framework, it's unclear how these limitations translate to real-world performance. Transformer models have proven to be remarkably flexible and capable in practice, often exceeding expectations. Further research would be needed to fully understand the practical implications of the B-RASP insights.

Overall, this paper makes an important contribution to the theoretical foundations of deep learning, but much work remains to be done in bridging the gap between theory and practice.

Conclusion

This paper presents a novel theoretical framework called B-RASP for analyzing Transformer models as a specific type of information transducer. By modeling the attention mechanism at the heart of Transformers as a "regular transducer," the authors are able to derive insights into the fundamental capabilities and limitations of these powerful AI models.

While highly technical, the B-RASP framework offers a valuable new perspective that could help guide the development of even more advanced deep learning architectures in the future. By better understanding the mathematical properties underlying Transformer models, researchers may be able to overcome current limitations and push the boundaries of what's possible in fields like natural language processing and computer vision.

Of course, translating these theoretical insights into practical improvements will require further research and experimentation. But this paper represents an important step forward in building a more rigorous, first-principles understanding of deep learning systems.

Related Papers

🎲

Learning Transductions and Alignments with RNN Seq2seq Models

Zhengxiang Wang

The paper studies the capabilities of Recurrent-Neural-Network sequence to sequence (RNN seq2seq) models in learning four transduction tasks: identity, reversal, total reduplication, and quadratic copying. These transductions are traditionally well studied under finite state transducers and attributed with increasing complexity. We find that RNN seq2seq models are only able to approximate a mapping that fits the training or in-distribution data, instead of learning the underlying functions. Although attention makes learning more efficient and robust, it does not overcome the out-of-distribution generalization limitation. We establish a novel complexity hierarchy for learning the four tasks for attention-less RNN seq2seq models, which may be understood in terms of the complexity hierarchy of formal languages, instead of string transductions. RNN variants also play a role in the results. In particular, we show that Simple RNN seq2seq models cannot count the input length.

4/23/2024

cs.CL

❗

Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers

Andy Yang, David Chiang

Deriving formal bounds on the expressivity of transformers, as well as studying transformers that are constructed to implement known algorithms, are both effective methods for better understanding the computational power of transformers. Towards both ends, we introduce the temporal counting logic $textbf{K}_text{t}$[#] alongside the RASP variant $textbf{C-RASP}$. We show they are equivalent to each other, and that together they are the best-known lower bound on the formal expressivity of future-masked soft attention transformers with unbounded input size. We prove this by showing all $textbf{K}_text{t}$[#] formulas can be compiled into these transformers. As a case study, we demonstrate on paper how to use $textbf{C-RASP}$ to construct simple transformer language models that, using greedy decoding, can only generate sentences that have given properties formally specified in $textbf{K}_text{t}$[#].

4/9/2024

cs.LO cs.CL cs.FL cs.LG

📉

The Expressive Power of Transformers with Chain of Thought

William Merrill, Ashish Sabharwal

Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers' reasoning can be improved by allowing them to use a chain of thought or scratchpad, i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoder-only transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps, assuming projected pre-norm (a slight generalization of standard pre-norm), adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within context-sensitive languages, and polynomial steps with generalized pre-norm make them recognize exactly the class of polynomial-time solvable problems -- the first exact characterization of a type of transformers in terms of standard complexity classes. Together, this provides a nuanced framework for understanding how the length of a transformer's chain of thought or scratchpad impacts its reasoning power.

4/15/2024

cs.LG cs.CC cs.CL cs.LO

🔎

What Formal Languages Can Transformers Express? A Survey

Lena Strobl, William Merrill, Gail Weiss, David Chiang, Dana Angluin

As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal languages. Exploring such questions can help clarify the power of transformers relative to other models of computation, their fundamental capabilities and limits, and the impact of architectural choices. Work in this subarea has made considerable progress in recent years. Here, we undertake a comprehensive survey of this work, documenting the diverse assumptions that underlie different results and providing a unified framework for harmonizing seemingly contradictory findings.

5/8/2024

cs.LG cs.CL cs.FL cs.LO