What Formal Languages Can Transformers Express? A Survey

2311.00208

Published 5/8/2024 by Lena Strobl, William Merrill, Gail Weiss, David Chiang, Dana Angluin

🔎

Abstract

As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal languages. Exploring such questions can help clarify the power of transformers relative to other models of computation, their fundamental capabilities and limits, and the impact of architectural choices. Work in this subarea has made considerable progress in recent years. Here, we undertake a comprehensive survey of this work, documenting the diverse assumptions that underlie different results and providing a unified framework for harmonizing seemingly contradictory findings.

Get summaries of the top AI research delivered straight to your inbox:

Overview

Researchers have investigated the theoretical capabilities and limitations of transformers, a prominent model in natural language processing.
This work aims to clarify the power of transformers relative to other models of computation, their fundamental capabilities and limits, and the impact of architectural choices.
The research has made considerable progress in recent years, and this survey aims to harmonize the diverse assumptions and seemingly contradictory findings.

Plain English Explanation

Transformers have become very popular in the field of natural language processing, where they are used to analyze and generate human-like text. Researchers have been exploring what kinds of problems transformers can and cannot solve, by treating them like formal languages with specific rules and structures.

This type of research is helpful for understanding the true power and limitations of transformers, and how they compare to other types of computational models. It can also shed light on how the design and architecture of transformers impact their capabilities.

The research in this area has advanced a lot in recent years, and this survey aims to bring together the various findings and assumptions that have been made, in order to provide a more unified and coherent picture of the field.

Technical Explanation

The research covered in this survey has explored the expressive power of transformers, investigating what types of formal languages they can and cannot represent. Some studies have shown that transformers can represent certain types of language models, while others have looked at their ability to reason about abstract symbols.

Additionally, researchers have examined the relationship between transformer models and the capabilities of mathematicians, exploring the types of problems that transformers can solve. This work has also considered the role of transformers as transducers, and how their architectural choices impact their performance.

Critical Analysis

The research surveyed in this paper provides valuable insights into the theoretical capabilities and limitations of transformers. However, it's important to note that the findings are based on specific assumptions and mathematical frameworks, which may not always align with the practical realities of how transformers are used in real-world applications.

Additionally, the research focuses on the theoretical properties of transformers, but doesn't always address the broader societal and ethical implications of these models, such as their potential for biases or misuse. Further work is needed to bridge the gap between the theoretical and practical aspects of transformer models.

Conclusion

This survey represents an important step in our understanding of the fundamental capabilities and limits of transformer models in natural language processing. By exploring the theoretical foundations of these models, the research provides a clearer picture of their power and potential, as well as the impact of architectural choices.

While the findings are valuable, it's crucial to continue advancing the field by considering a wide range of perspectives and real-world applications. As transformer models become more widely adopted, it will be essential to maintain a critical and holistic view of their capabilities and implications for society.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

📉

The Expressive Power of Transformers with Chain of Thought

William Merrill, Ashish Sabharwal

Recent theoretical work has identified surprisingly simple reasoning problems, such as checking if two nodes in a graph are connected or simulating finite-state machines, that are provably unsolvable by standard transformers that answer immediately after reading their input. However, in practice, transformers' reasoning can be improved by allowing them to use a chain of thought or scratchpad, i.e., generate and condition on a sequence of intermediate tokens before answering. Motivated by this, we ask: Does such intermediate generation fundamentally extend the computational power of a decoder-only transformer? We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. For instance, we find that transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps, assuming projected pre-norm (a slight generalization of standard pre-norm), adds a clear new ability (under standard complexity conjectures): recognizing all regular languages. Our results also imply that linear steps keep transformer decoders within context-sensitive languages, and polynomial steps with generalized pre-norm make them recognize exactly the class of polynomial-time solvable problems -- the first exact characterization of a type of transformers in terms of standard complexity classes. Together, this provides a nuanced framework for understanding how the length of a transformer's chain of thought or scratchpad impacts its reasoning power.

4/15/2024

cs.LG cs.CC cs.CL cs.LO

💬

Transformers Can Represent $n$-gram Language Models

Anej Svete, Ryan Cotterell

Plenty of existing work has analyzed the abilities of the transformer architecture by describing its representational capacity with formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language emph{acceptance}. We contend that this is an ill-suited problem in the study of emph{language models} (LMs), which are definitionally emph{probability distributions} over strings. In this paper, we focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings.

4/24/2024

cs.CL cs.AI cs.CC cs.FL cs.LG

🌐

When can transformers reason with abstract symbols?

Enric Boix-Adsera, Omid Saremi, Emmanuel Abbe, Samy Bengio, Etai Littwin, Joshua Susskind

We investigate the capabilities of transformer models on relational reasoning tasks. In these tasks, models are trained on a set of strings encoding abstract relations, and are then tested out-of-distribution on data that contains symbols that did not appear in the training dataset. We prove that for any relational reasoning task in a large family of tasks, transformers learn the abstract relations and generalize to the test set when trained by gradient descent on sufficiently large quantities of training data. This is in contrast to classical fully-connected networks, which we prove fail to learn to reason. Our results inspire modifications of the transformer architecture that add only two trainable parameters per head, and that we empirically demonstrate improve data efficiency for learning to reason.

4/17/2024

cs.CL cs.AI cs.LG

Large Language Models for Mathematicians

Simon Frieder, Julius Berner, Philipp Petersen, Thomas Lukasiewicz

Large language models (LLMs) such as ChatGPT have received immense interest for their general-purpose language understanding and, in particular, their ability to generate high-quality text or computer code. For many professions, LLMs represent an invaluable tool that can speed up and improve the quality of work. In this note, we discuss to what extent they can aid professional mathematicians. We first provide a mathematical description of the transformer model used in all modern language models. Based on recent studies, we then outline best practices and potential issues and report on the mathematical abilities of language models. Finally, we shed light on the potential of LLMs to change how mathematicians work.

4/3/2024

cs.CL cs.AI cs.LG