From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

2405.14838

Published 5/24/2024 by Yuntian Deng, Yejin Choi, Stuart Shieber

🐍

Abstract

When leveraging language models for reasoning tasks, generating explicit chain-of-thought (CoT) steps often proves essential for achieving high accuracy in final outputs. In this paper, we investigate if models can be taught to internalize these CoT steps. To this end, we propose a simple yet effective method for internalizing CoT steps: starting with a model trained for explicit CoT reasoning, we gradually remove the intermediate steps and finetune the model. This process allows the model to internalize the intermediate reasoning steps, thus simplifying the reasoning process while maintaining high performance. Our approach enables a GPT-2 Small model to solve 9-by-9 multiplication with up to 99% accuracy, whereas standard training cannot solve beyond 4-by-4 multiplication. Furthermore, our method proves effective on larger language models, such as Mistral 7B, achieving over 50% accuracy on GSM8K without producing any intermediate steps.

Create account to get full access

Overview

This paper investigates how language models can be taught to internalize the step-by-step reasoning process, known as chain-of-thought (CoT) reasoning, that is often essential for achieving high accuracy in complex tasks.
The authors propose a simple yet effective method to help models internalize these CoT steps, which involves gradually removing the intermediate steps from a model trained for explicit CoT reasoning and then fine-tuning it.
This process allows the model to internalize the intermediate reasoning steps, simplifying the reasoning process while maintaining high performance.
The authors demonstrate the effectiveness of their approach on various tasks, including solving 9-by-9 multiplication with up to 99% accuracy and achieving over 50% accuracy on the GSM8K benchmark without producing any intermediate steps.

Plain English Explanation

When using language models to solve complex reasoning tasks, it often helps to have the model explicitly show its step-by-step thought process, or chain-of-thought (CoT). This can lead to more accurate final outputs.

The researchers in this paper wanted to see if they could get language models to internalize this step-by-step reasoning process, so that they don't need to explicitly show the intermediate steps. They developed a simple method to do this: they started with a model that was trained to do CoT reasoning, and then gradually removed the intermediate steps while fine-tuning the model.

This allowed the model to essentially absorb the step-by-step reasoning process, so that it could solve complex tasks like 9-by-9 multiplication with up to 99% accuracy, without needing to show the individual steps. The researchers also found this method worked well for larger language models, like Mistral 7B, helping them solve over 50% of the GSM8K benchmark without producing any intermediate steps.

The key benefit of this approach is that it can simplify the reasoning process for language models, making them more efficient and easier to deploy, while still maintaining high performance on complex tasks. This could be particularly useful for real-world applications where intermediate steps aren't always needed or desired.

Technical Explanation

The paper proposes a method to internalize the chain-of-thought (CoT) reasoning process within language models. The authors start with a model that has been trained to perform explicit CoT reasoning, and then gradually remove the intermediate reasoning steps while fine-tuning the model.

This process allows the model to internalize the step-by-step reasoning process, effectively simplifying the reasoning process while maintaining high performance on the target task. The authors evaluate their approach on two main tasks:

9-by-9 Multiplication: The authors show that their method enables a GPT-2 Small model to solve 9-by-9 multiplication with up to 99% accuracy, whereas standard training can only solve up to 4-by-4 multiplication.
GSM8K Benchmark: The authors demonstrate the effectiveness of their method on larger language models, such as Mistral 7B, achieving over 50% accuracy on the GSM8K benchmark without producing any intermediate steps.

The key innovation of this work is the gradual removal of intermediate reasoning steps during fine-tuning, which allows the model to internalize the CoT reasoning process. This simplifies the model's internal reasoning while maintaining high performance, making it potentially more efficient and easier to deploy in real-world applications.

Critical Analysis

The paper presents a promising approach for teaching language models to internalize step-by-step reasoning, but there are a few areas that could be explored further:

Generalization to a wider range of tasks: The authors focused their evaluation on multiplication and the GSM8K benchmark. It would be interesting to see how the method performs on a more diverse set of reasoning tasks, such as multi-step reasoning across languages.
Interpretability of the internalized reasoning: While the authors show that the models can maintain high performance without explicit CoT steps, it's unclear how the internalized reasoning process works. Developing methods to better understand and interpret the models' internal reasoning could provide valuable insights.
Scalability to larger models: The authors demonstrated the effectiveness of their approach on a GPT-2 Small model and Mistral 7B. Exploring the scalability of this method to even larger language models could further expand its potential impact.

Overall, the paper presents a compelling approach for teaching language models to internalize step-by-step reasoning, which could have important implications for the development of more efficient and capable AI systems.

Conclusion

This paper introduces a simple yet effective method for teaching language models to internalize the step-by-step reasoning process, known as chain-of-thought (CoT) reasoning. The authors show that their approach allows language models to solve complex tasks, like 9-by-9 multiplication and the GSM8K benchmark, with high accuracy and without the need to explicitly generate intermediate reasoning steps.

This internalization of the CoT process could lead to more efficient and deployable language models, as the simplified reasoning process may be easier to understand and optimize. While the paper focuses on specific tasks, the authors suggest that their method could be applicable to a wider range of multi-step reasoning problems, potentially expanding the capabilities of language models in real-world applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

mCoT: Multilingual Instruction Tuning for Reasoning Consistency in Language Models

Huiyuan Lai, Malvina Nissim

Large language models (LLMs) with Chain-of-thought (CoT) have recently emerged as a powerful technique for eliciting reasoning to improve various downstream tasks. As most research mainly focuses on English, with few explorations in a multilingual context, the question of how reliable this reasoning capability is in different languages is still open. To address it directly, we study multilingual reasoning consistency across multiple languages, using popular open-source LLMs. First, we compile the first large-scale multilingual math reasoning dataset, mCoT-MATH, covering eleven diverse languages. Then, we introduce multilingual CoT instruction tuning to boost reasoning capability across languages, thereby improving model consistency. While existing LLMs show substantial variation across the languages we consider, and especially low performance for lesser resourced languages, our 7B parameter model mCoT achieves impressive consistency across languages, and superior or comparable performance to close- and open-source models even of much larger sizes.

6/5/2024

cs.CL

🤔

How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning

Subhabrata Dutta, Joykirat Singh, Soumen Chakrabarti, Tanmoy Chakraborty

Despite superior reasoning prowess demonstrated by Large Language Models (LLMs) with Chain-of-Thought (CoT) prompting, a lack of understanding prevails around the internal mechanisms of the models that facilitate CoT generation. This work investigates the neural sub-structures within LLMs that manifest CoT reasoning from a mechanistic point of view. From an analysis of Llama-2 7B applied to multistep reasoning over fictional ontologies, we demonstrate that LLMs deploy multiple parallel pathways of answer generation for step-by-step reasoning. These parallel pathways provide sequential answers from the input question context as well as the generated CoT. We observe a functional rift in the middle layers of the LLM. Token representations in the initial half remain strongly biased towards the pretraining prior, with the in-context prior taking over in the later half. This internal phase shift manifests in different functional components: attention heads that write the answer token appear in the later half, attention heads that move information along ontological relationships appear in the initial half, and so on. To the best of our knowledge, this is the first attempt towards mechanistic investigation of CoT reasoning in LLMs.

5/7/2024

cs.CL cs.LG

💬

Multimodal Chain-of-Thought Reasoning in Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola

Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have primarily focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach. With Multimodal-CoT, our model under 1 billion parameters achieves state-of-the-art performance on the ScienceQA benchmark. Our analysis indicates that Multimodal-CoT offers the advantages of mitigating hallucination and enhancing convergence speed. Code is publicly available at https://github.com/amazon-science/mm-cot.

5/21/2024

cs.CL cs.AI cs.CV

Beyond Imitation: Learning Key Reasoning Steps from Dual Chain-of-Thoughts in Reasoning Distillation

Chengwei Dai, Kun Li, Wei Zhou, Songlin Hu

As Large Language Models (LLMs) scale up and gain powerful Chain-of-Thoughts (CoTs) reasoning abilities, practical resource constraints drive efforts to distill these capabilities into more compact Smaller Language Models (SLMs). We find that CoTs consist mainly of simple reasoning forms, with a small proportion ($approx 4.7%$) of key reasoning steps that truly impact conclusions. However, previous distillation methods typically involve supervised fine-tuning student SLMs only on correct CoTs data produced by teacher LLMs, resulting in students struggling to learn the key reasoning steps, instead imitating the teacher's reasoning forms and making errors or omissions on these steps. To address these issues, drawing an analogy to human learning, where analyzing mistakes according to correct solutions often reveals the crucial steps leading to successes or failures, we propose mistaktextbf{E}-textbf{D}riven key reasontextbf{I}ng step distillatextbf{T}ion (textbf{EDIT}), a novel method that further aids SLMs learning key reasoning steps rather than mere simple fine-tuning. Firstly, to expose these crucial steps in CoTs, we design specific prompts to generate dual CoTs data with similar reasoning paths but divergent conclusions. Then, we apply the minimum edit distance algorithm on the dual CoTs data to locate these key steps and optimize the likelihood of these steps. Extensive experiments validate the effectiveness of EDIT across both in-domain and out-of-domain benchmark reasoning datasets. Further analysis shows that EDIT can generate high-quality CoTs with more correct key reasoning steps. Notably, we also explore how different mistake patterns affect performance and find that EDIT benefits more from logical errors than from knowledge or mathematical calculation errors in dual CoTsfootnote{Code can be found at url{https://github.com/C-W-D/EDIT}}.

5/31/2024

cs.CL cs.AI