Symbolic Chain-of-Thought Distillation: Small Models Can Also Think Step-by-Step

2306.14050

Published 4/17/2024 by Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, Yejin Choi

Symbolic Chain-of-Thought Distillation: Small Models Can Also Think Step-by-Step

Abstract

Chain-of-thought prompting (e.g., Let's think step-by-step) primes large language models to verbalize rationalization for their predictions. While chain-of-thought can lead to dramatic performance gains, benefits appear to emerge only for sufficiently large models (beyond 50B parameters). We show that orders-of-magnitude smaller models (125M -- 1.3B parameters) can still benefit from chain-of-thought prompting. To achieve this, we introduce Symbolic Chain-of-Thought Distillation (SCoTD), a method to train a smaller student model on rationalizations sampled from a significantly larger teacher model. Experiments across several commonsense benchmarks show that: 1) SCoTD enhances the performance of the student model in both supervised and few-shot settings, and especially for challenge sets; 2) sampling many reasoning chains per instance from the teacher is paramount; and 3) after distillation, student chain-of-thoughts are judged by humans as comparable to the teacher, despite orders of magnitude fewer parameters. We test several hypotheses regarding what properties of chain-of-thought samples are important, e.g., diversity vs. teacher likelihood vs. open-endedness. We release our corpus of chain-of-thought samples and code.

Create account to get full access

Overview

This paper explores a novel approach called "Symbolic Chain-of-Thought Distillation" that enables smaller language models to perform step-by-step reasoning like larger models.
The method involves training a small model to mimic the step-by-step thought process of a larger, more capable model.
The researchers demonstrate that the distilled small model can outperform the original small model and even match the performance of the larger model on certain reasoning tasks.

Plain English Explanation

In this paper, the researchers present a new technique called "Symbolic Chain-of-Thought Distillation" that allows smaller language models to 'think' in a step-by-step manner, just like larger and more sophisticated models. [This builds on previous work on distilling knowledge from large models into smaller ones, as seen in papers like "Can Small Language Models Help Large Language Models" and "Demystifying Chains, Trees, and Graphs of Thoughts".]

The key idea is to train a smaller model to mimic the step-by-step reasoning process of a larger, more capable model. This is done by having the larger model 'show its work' as it solves various reasoning problems, and then using that information to train the smaller model to follow a similar line of thinking.

The researchers demonstrate that this distillation process allows the smaller model to outperform its original capabilities and in some cases even match the performance of the larger model. This is an exciting development, as it means that the benefits of sophisticated step-by-step reasoning can be enjoyed even with more limited computational resources.

[This work relates to other recent efforts to extract and transfer cognitive capabilities from large to small models, such as "Minds Mirror: Distilling Self-Evaluation Capability" and "Post-Semantic Thinking: A Robust Strategy to Distill Reasoning Capabilities".]

Technical Explanation

The key innovation of this paper is the "Symbolic Chain-of-Thought Distillation" technique, which allows smaller language models to mimic the step-by-step reasoning process of larger models.

The researchers first train a larger "teacher" model to solve a variety of reasoning tasks, and have it output not just the final answer, but also the intermediate steps it took to arrive at that answer. This creates a "chain of thought" that the smaller "student" model can then learn to imitate.

The training process involves having the student model observe the teacher's thought process and learn to generate similar sequences of reasoning steps. This is achieved through a multi-task training setup, where the student model is jointly optimized to predict the teacher's intermediate steps as well as the final answer.

The researchers evaluate their approach on a range of reasoning benchmarks, including mathematical word problems, logical inference tasks, and open-ended question answering. They show that the distilled student model is able to outperform the original small model and in some cases even match the performance of the larger teacher model.

[This work builds on prior research on enabling smaller models to reason more effectively, such as the "Soft Prompting" technique described in "Soft Prompting: Graph Thought, Multi-Modal Representation".]

Critical Analysis

One key limitation of this work is that the performance gains of the distilled student model appear to be task-specific. The paper does not investigate whether the step-by-step reasoning capabilities learned by the student model can be seamlessly transferred to completely novel tasks.

Additionally, the training process for Symbolic Chain-of-Thought Distillation is computationally intensive, as it requires the large teacher model to generate detailed reasoning traces for each training example. This could make the technique challenging to scale to extremely large datasets or models.

[While the authors mention the potential for this approach to democratize access to sophisticated reasoning capabilities, further research is needed to address these limitations and fully realize the benefits of this technique, as discussed in "Can Small Language Models Help Large Language Models".]

Conclusion

This paper presents a novel technique called Symbolic Chain-of-Thought Distillation that enables smaller language models to perform step-by-step reasoning like their larger counterparts. By learning to mimic the thought process of a more capable "teacher" model, the distilled "student" model can outperform its original capabilities and in some cases even match the performance of the larger model.

This work represents an exciting development in the field of knowledge distillation, as it suggests that the benefits of advanced reasoning abilities can be enjoyed even with limited computational resources. Further research is needed to address the technique's limitations and explore its broader applicability, but the results presented in this paper are a promising step towards democratizing access to sophisticated cognitive capabilities.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

🌿

Chain of Thoughtlessness: An Analysis of CoT in Planning

Kaya Stechly, Karthik Valmeekam, Subbarao Kambhampati

Large language model (LLM) performance on reasoning problems typically does not generalize out of distribution. Previous work has claimed that this can be mitigated with chain of thought prompting-a method of demonstrating solution procedures-with the intuition that it is possible to in-context teach an LLM an algorithm for solving the problem. This paper presents a case study of chain of thought on problems from Blocksworld, a classical planning domain, and examines the performance of two state-of-the-art LLMs across two axes: generality of examples given in prompt, and complexity of problems queried with each prompt. While our problems are very simple, we only find meaningful performance improvements from chain of thought prompts when those prompts are exceedingly specific to their problem class, and that those improvements quickly deteriorate as the size n of the query-specified stack grows past the size of stacks shown in the examples. We also create scalable variants of three domains commonly studied in previous CoT papers and demonstrate the existence of similar failure modes. Our results hint that, contrary to previous claims in the literature, CoT's performance improvements do not stem from the model learning general algorithmic procedures via demonstrations but depend on carefully engineering highly problem specific prompts. This spotlights drawbacks of chain of thought, especially the sharp tradeoff between possible performance gains and the amount of human labor necessary to generate examples with correct reasoning traces.

6/7/2024

cs.AI

Keypoint-based Progressive Chain-of-Thought Distillation for LLMs

Kaituo Feng, Changsheng Li, Xiaolu Zhang, Jun Zhou, Ye Yuan, Guoren Wang

Chain-of-thought distillation is a powerful technique for transferring reasoning abilities from large language models (LLMs) to smaller student models. Previous methods typically require the student to mimic the step-by-step rationale produced by LLMs, often facing the following challenges: (i) Tokens within a rationale vary in significance, and treating them equally may fail to accurately mimic keypoint tokens, leading to reasoning errors. (ii) They usually distill knowledge by consistently predicting all the steps in a rationale, which falls short in distinguishing the learning order of step generation. This diverges from the human cognitive progression of starting with easy tasks and advancing to harder ones, resulting in sub-optimal outcomes. To this end, we propose a unified framework, called KPOD, to address these issues. Specifically, we propose a token weighting module utilizing mask learning to encourage accurate mimicry of keypoint tokens by the student during distillation. Besides, we develop an in-rationale progressive distillation strategy, starting with training the student to generate the final reasoning steps and gradually extending to cover the entire rationale. To accomplish this, a weighted token generation loss is proposed to assess step reasoning difficulty, and a value function is devised to schedule the progressive distillation by considering both step difficulty and question diversity. Extensive experiments on four reasoning benchmarks illustrate our KPOD outperforms previous methods by a large margin.

5/28/2024

cs.CL

Improve Student's Reasoning Generalizability through Cascading Decomposed CoTs Distillation

Chengwei Dai, Kun Li, Wei Zhou, Songlin Hu

Large language models (LLMs) exhibit enhanced reasoning at larger scales, driving efforts to distill these capabilities into smaller models via teacher-student learning. Previous works simply fine-tune student models on teachers' generated Chain-of-Thoughts (CoTs) data. Although these methods enhance in-domain (IND) reasoning performance, they struggle to generalize to out-of-domain (OOD) tasks. We believe that the widespread spurious correlations between questions and answers may lead the model to preset a specific answer which restricts the diversity and generalizability of its reasoning process. In this paper, we propose Cascading Decomposed CoTs Distillation (CasCoD) to address these issues by decomposing the traditional single-step learning process into two cascaded learning steps. Specifically, by restructuring the training objectives -- removing the answer from outputs and concatenating the question with the rationale as input -- CasCoD's two-step learning process ensures that students focus on learning rationales without interference from the preset answers, thus improving reasoning generalizability. Extensive experiments demonstrate the effectiveness of CasCoD on both IND and OOD benchmark reasoning datasets. Code can be found at https://github.com/C-W-D/CasCoD.

5/31/2024

cs.CL cs.AI

Beyond Imitation: Learning Key Reasoning Steps from Dual Chain-of-Thoughts in Reasoning Distillation

Chengwei Dai, Kun Li, Wei Zhou, Songlin Hu

As Large Language Models (LLMs) scale up and gain powerful Chain-of-Thoughts (CoTs) reasoning abilities, practical resource constraints drive efforts to distill these capabilities into more compact Smaller Language Models (SLMs). We find that CoTs consist mainly of simple reasoning forms, with a small proportion ($approx 4.7%$) of key reasoning steps that truly impact conclusions. However, previous distillation methods typically involve supervised fine-tuning student SLMs only on correct CoTs data produced by teacher LLMs, resulting in students struggling to learn the key reasoning steps, instead imitating the teacher's reasoning forms and making errors or omissions on these steps. To address these issues, drawing an analogy to human learning, where analyzing mistakes according to correct solutions often reveals the crucial steps leading to successes or failures, we propose mistaktextbf{E}-textbf{D}riven key reasontextbf{I}ng step distillatextbf{T}ion (textbf{EDIT}), a novel method that further aids SLMs learning key reasoning steps rather than mere simple fine-tuning. Firstly, to expose these crucial steps in CoTs, we design specific prompts to generate dual CoTs data with similar reasoning paths but divergent conclusions. Then, we apply the minimum edit distance algorithm on the dual CoTs data to locate these key steps and optimize the likelihood of these steps. Extensive experiments validate the effectiveness of EDIT across both in-domain and out-of-domain benchmark reasoning datasets. Further analysis shows that EDIT can generate high-quality CoTs with more correct key reasoning steps. Notably, we also explore how different mistake patterns affect performance and find that EDIT benefits more from logical errors than from knowledge or mathematical calculation errors in dual CoTsfootnote{Code can be found at url{https://github.com/C-W-D/EDIT}}.

5/31/2024

cs.CL cs.AI