The Expressive Power of Transformers with Chain of Thought

2310.07923

Published 4/15/2024 by William Merrill, Ashish Sabharwal

📉

Abstract

Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers' reasoning can be improved by allowing them to use a chain of thought or scratchpad, i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoder-only transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps, assuming projected pre-norm (a slight generalization of standard pre-norm), adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within context-sensitive languages, and polynomial steps with generalized pre-norm make them recognize exactly the class of polynomial-time solvable problems -- the first exact characterization of a type of transformers in terms of standard complexity classes. Together, this provides a nuanced framework for understanding how the length of a transformer's chain of thought or scratchpad impacts its reasoning power.

Create account to get full access

Overview

Researchers have found that standard transformer models, which provide immediate outputs, are limited in their ability to solve certain simple reasoning problems.
However, transformers can improve their reasoning by generating and conditioning on a sequence of intermediate tokens before answering, known as a "chain of thought" or "scratchpad."
This paper explores whether this intermediate generation fundamentally extends the computational power of a decoder-only transformer.

Plain English Explanation

Transformers are a type of artificial intelligence model that have become widely used for tasks like language processing and generation. These models typically provide an immediate output after reading their input.

However, recent research has shown that there are some surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that standard transformers cannot solve effectively.

To address this limitation, the researchers in this paper explored whether allowing transformers to use a "chain of thought" or "scratchpad" could fundamentally increase their computational power. This means the transformer would generate and condition on a sequence of intermediate tokens before providing a final answer.

The key finding is that the answer is yes, but the amount of increase in computational power depends heavily on the length of the intermediate generation. For example, a transformer with a logarithmic number of decoding steps (relative to the input length) only slightly improves on standard transformers, while a linear number of decoding steps can allow the transformer to recognize all regular languages, a clear new ability.

The researchers also show that linear decoding steps keep transformers within the class of context-sensitive languages, while polynomial steps with a certain generalization can make them recognize exactly the class of problems solvable in polynomial time. This provides a nuanced framework for understanding how the length of a transformer's "chain of thought" impacts its reasoning power.

Technical Explanation

This paper examines whether allowing transformer decoders to generate and condition on a sequence of intermediate tokens, rather than providing a single immediate output, can fundamentally extend their computational power.

The researchers first establish that standard transformer decoders are provably limited in their ability to solve certain simple reasoning problems, such as checking graph connectivity and simulating finite-state machines. This is due to their inability to maintain and update an internal state or "scratchpad" as they process the input.

To overcome this limitation, the authors consider transformer decoders that are allowed to generate and condition on a sequence of intermediate tokens before providing a final output. This mimics the "chain of thought" or "scratchpad" that humans often use when solving complex reasoning problems.

The key results are:

A logarithmic number of decoding steps (relative to input length) only slightly extends the power of standard transformers.
A linear number of decoding steps, with a slight generalization of standard "pre-norm" layers, allows transformers to recognize all regular languages, a clear new ability.
Linear decoding steps keep transformers within the class of context-sensitive languages.
Polynomial decoding steps with a further generalization make transformers recognize exactly the class of problems solvable in polynomial time.

These findings provide a nuanced framework for understanding how the length of a transformer's "chain of thought" impacts its reasoning capabilities, ranging from slight improvements to the ability to solve more complex problems.

Critical Analysis

The paper provides a rigorous theoretical analysis of how the ability to generate and condition on intermediate tokens can extend the computational power of transformer decoders. The insights offered are valuable for understanding the fundamental limitations and capabilities of these widely used models.

One potential limitation of the research is that it focuses solely on the theoretical computational power of transformers, without considering practical aspects such as training dynamics, sample efficiency, and real-world performance. While the theoretical results are important, it would be valuable to see how these insights translate to actual transformer-based systems and their performance on relevant tasks.

Additionally, the paper does not explore the implications of these findings for the development of more capable reasoning systems. Further research could investigate how the insights from this work could be leveraged to design transformer architectures or training approaches that better support robust reasoning and problem-solving abilities.

Overall, this paper offers a significant contribution to the understanding of transformer models and their computational limitations and capabilities. The findings provide a solid foundation for future research in this area.

Conclusion

This paper investigates whether allowing transformer decoders to generate and condition on a sequence of intermediate tokens, rather than providing a single immediate output, can fundamentally extend their computational power. The key finding is that it can, but the extent of the increase depends heavily on the length of the intermediate generation process.

The researchers show that a logarithmic number of decoding steps only slightly improves on standard transformers, while a linear number of steps can enable transformers to recognize all regular languages. They also characterize the computational classes that transformers with different step counts can represent, providing a nuanced framework for understanding the impact of "chain of thought" or "scratchpad" generation on reasoning abilities.

These insights contribute to a deeper understanding of the fundamental limitations and capabilities of transformer models, which are widely used in various artificial intelligence applications. The work also suggests promising directions for future research on developing more capable reasoning systems, by leveraging the potential of intermediate generation to push the boundaries of what transformers can achieve.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

Zhiyuan Li, Hong Liu, Denny Zhou, Tengyu Ma

Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length $n$, previous works have shown that constant-depth transformers with finite precision $mathsf{poly}(n)$ embedding size can only solve problems in $mathsf{TC}^0$ without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in $mathsf{AC}^0$, a proper subset of $ mathsf{TC}^0$. However, with $T$ steps of CoT, constant-depth transformers using constant-bit precision and $O(log n)$ embedding size can solve any problem solvable by boolean circuits of size $T$. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.

5/24/2024

cs.LG cs.CC stat.ML

How Far Can Transformers Reason? The Locality Barrier and Inductive Scratchpad

Emmanuel Abbe, Samy Bengio, Aryo Lotfi, Colin Sandon, Omid Saremi

Can Transformers predict new syllogisms by composing established ones? More generally, what type of targets can be learned by such models from scratch? Recent works show that Transformers can be Turing-complete in terms of expressivity, but this does not address the learnability objective. This paper puts forward the notion of 'distribution locality' to capture when weak learning is efficiently achievable by regular Transformers, where the locality measures the least number of tokens required in addition to the tokens histogram to correlate nontrivially with the target. As shown experimentally and theoretically under additional assumptions, distributions with high locality cannot be learned efficiently. In particular, syllogisms cannot be composed on long chains. Furthermore, we show that (i) an agnostic scratchpad cannot help to break the locality barrier, (ii) an educated scratchpad can help if it breaks the locality at each step, (iii) a notion of 'inductive scratchpad' can both break the locality and improve the out-of-distribution generalization, e.g., generalizing to almost double input size for some arithmetic tasks.

6/11/2024

cs.LG cs.AI stat.ML

🤔

Understanding Transformer Reasoning Capabilities via Graph Algorithms

Clayton Sanford, Bahare Fatemi, Ethan Hall, Anton Tsitsulin, Mehran Kazemi, Jonathan Halcrow, Bryan Perozzi, Vahab Mirrokni

Which transformer scaling regimes are able to perfectly solve different classes of algorithmic problems? While tremendous empirical advances have been attained by transformer-based neural networks, a theoretical understanding of their algorithmic reasoning capabilities in realistic parameter regimes is lacking. We investigate this question in terms of the network's depth, width, and number of extra tokens for algorithm execution. Our novel representational hierarchy separates 9 algorithmic reasoning problems into classes solvable by transformers in different realistic parameter scaling regimes. We prove that logarithmic depth is necessary and sufficient for tasks like graph connectivity, while single-layer transformers with small embedding dimensions can solve contextual retrieval tasks. We also support our theoretical analysis with ample empirical evidence using the GraphQA benchmark. These results show that transformers excel at many graph reasoning tasks, even outperforming specialized graph neural networks.

5/30/2024

cs.LG cs.AI

🧠

On the Representational Capacity of Neural Language Models with Chain-of-Thought Reasoning

Franz Nowak, Anej Svete, Alexandra Butoi, Ryan Cotterell

The performance of modern language models (LMs) has been improved by chain-of-thought (CoT) reasoning, i.e., the process of generating intermediate results that guide the model towards a final answer. A possible explanation for this improvement is that CoT reasoning extends an LM's computational power, as RNNs and transformers with additional scratch space are known to be Turing complete. Comparing LMs to Turing machines, however, introduces a category error - Turing machines decide language membership, whereas LMs define distributions over strings. To bridge this gap, we formalize CoT reasoning in a probabilistic setting. We present several results on the representational capacity of recurrent and transformer LMs with CoT reasoning, showing that they can represent the same family of distributions over strings as probabilistic Turing machines.

6/21/2024

cs.CL cs.FL