Beyond Imitation: Learning Key Reasoning Steps from Dual Chain-of-Thoughts in Reasoning Distillation

2405.19737

Published 5/31/2024 by Chengwei Dai, Kun Li, Wei Zhou, Songlin Hu

Beyond Imitation: Learning Key Reasoning Steps from Dual Chain-of-Thoughts in Reasoning Distillation

Abstract

As Large Language Models (LLMs) scale up and gain powerful Chain-of-Thoughts (CoTs) reasoning abilities, practical resource constraints drive efforts to distill these capabilities into more compact Smaller Language Models (SLMs). We find that CoTs consist mainly of simple reasoning forms, with a small proportion ($approx 4.7%$) of key reasoning steps that truly impact conclusions. However, previous distillation methods typically involve supervised fine-tuning student SLMs only on correct CoTs data produced by teacher LLMs, resulting in students struggling to learn the key reasoning steps, instead imitating the teacher's reasoning forms and making errors or omissions on these steps. To address these issues, drawing an analogy to human learning, where analyzing mistakes according to correct solutions often reveals the crucial steps leading to successes or failures, we propose mistaktextbf{E}-textbf{D}riven key reasontextbf{I}ng step distillatextbf{T}ion (textbf{EDIT}), a novel method that further aids SLMs learning key reasoning steps rather than mere simple fine-tuning. Firstly, to expose these crucial steps in CoTs, we design specific prompts to generate dual CoTs data with similar reasoning paths but divergent conclusions. Then, we apply the minimum edit distance algorithm on the dual CoTs data to locate these key steps and optimize the likelihood of these steps. Extensive experiments validate the effectiveness of EDIT across both in-domain and out-of-domain benchmark reasoning datasets. Further analysis shows that EDIT can generate high-quality CoTs with more correct key reasoning steps. Notably, we also explore how different mistake patterns affect performance and find that EDIT benefits more from logical errors than from knowledge or mathematical calculation errors in dual CoTsfootnote{Code can be found at url{https://github.com/C-W-D/EDIT}}.

Create account to get full access

Overview

This paper introduces a novel approach called "Reasoning Distillation" that aims to improve the reasoning capabilities of language models by learning key reasoning steps from expert-provided "dual chain-of-thoughts".
The key idea is to have a teacher model provide two parallel reasoning chains - one that leads to the correct answer, and one that leads to an incorrect answer. The student model then learns to identify and extract the key reasoning steps from the dual chain-of-thoughts.
This approach goes beyond simple imitation learning, as the student model learns to reason in a more generalizable way, not just mimic the teacher's outputs.
The paper demonstrates the effectiveness of this approach on various reasoning benchmarks, showing that the student models can outperform the teacher models and generalize better to novel tasks.

Plain English Explanation

In this paper, the researchers present a new way to train language models to become better at reasoning and problem-solving. Their approach, called "Reasoning Distillation", involves having an expert model demonstrate two different chains of thought - one that leads to the correct answer, and one that leads to an incorrect answer.

The student model then learns to identify and extract the key steps in the reasoning process that led to the correct answer. This is different from simply trying to mimic the expert model's outputs, as the student model is actually learning to reason in a more general and transferable way.

By learning from these "dual chain-of-thoughts", the student model is able to go beyond just imitating the expert and develop a deeper understanding of the reasoning process. This allows the student model to outperform the expert on various reasoning benchmarks and generalize better to new tasks.

The researchers' key insight is that providing both the correct and incorrect reasoning chains gives the student model valuable information about what constitutes good reasoning. This helps the student model learn to reason more effectively, rather than just memorizing specific answers.

Technical Explanation

The paper presents a novel approach called "Reasoning Distillation" that aims to improve the reasoning capabilities of language models. The key innovation is the use of "dual chain-of-thoughts" - where the teacher model provides two parallel reasoning chains, one leading to the correct answer and one leading to an incorrect answer.

The student model then learns to identify and extract the key reasoning steps from the dual chain-of-thoughts, going beyond simple imitation learning. This allows the student model to develop more generalizable reasoning skills, rather than just mimicking the teacher's outputs.

The paper evaluates this approach on several reasoning benchmarks, including Improve Students' Reasoning Generalizability Through Cascading Decomposed, Keypoint-based Progressive Chain-Thought Distillation for LLMs, and Minds in Mirror: Distilling Self-Evaluation Capability in Comprehensive. The results show that the student models trained with Reasoning Distillation can outperform the teacher models and generalize better to novel tasks.

The architecture of the Reasoning Distillation approach involves a teacher model that generates the dual chain-of-thoughts, and a student model that learns to extract the key reasoning steps from this input. This relates to other work on How to Think Step-by-Step and Multimodal Chain-of-Thought Reasoning for Language Models.

Critical Analysis

The paper presents a promising approach to improving the reasoning capabilities of language models, and the empirical results on benchmark tasks are compelling. However, the paper does not fully address some potential limitations and avenues for future research.

One concern is the reliance on expert-provided dual chain-of-thoughts, which may limit the scalability of the approach. Generating high-quality dual chains requires significant effort and expertise. Exploring ways to automatically generate or extract these dual chains could make the method more widely applicable.

Additionally, the paper does not delve into the interpretability of the student model's reasoning process. Understanding how the student model is able to generalize the teacher's reasoning would be valuable for building trust and transparency in these systems.

Further research could also investigate the application of Reasoning Distillation to other reasoning-intensive domains beyond language models, such as task-oriented dialogue or multi-agent cooperation. Adapting the approach to work with different types of reasoning and problem-solving tasks could expand its real-world impact.

Conclusion

The "Reasoning Distillation" approach presented in this paper is a promising step towards developing more capable and generalizable reasoning skills in language models. By learning from expert-provided dual chain-of-thoughts, the student models are able to go beyond simple imitation and develop a deeper understanding of the reasoning process.

The empirical results demonstrate the effectiveness of this approach, with student models outperforming their teacher counterparts and showing strong generalization to novel tasks. While there are some potential limitations to address, this work represents an important advancement in the field of language model reasoning and problem-solving.

As language models continue to play a larger role in our lives, improving their reasoning capabilities will be crucial for ensuring they can be trusted and relied upon to solve complex real-world problems. The Reasoning Distillation approach is an important step in this direction.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Improve Student's Reasoning Generalizability through Cascading Decomposed CoTs Distillation

Chengwei Dai, Kun Li, Wei Zhou, Songlin Hu

Large language models (LLMs) exhibit enhanced reasoning at larger scales, driving efforts to distill these capabilities into smaller models via teacher-student learning. Previous works simply fine-tune student models on teachers' generated Chain-of-Thoughts (CoTs) data. Although these methods enhance in-domain (IND) reasoning performance, they struggle to generalize to out-of-domain (OOD) tasks. We believe that the widespread spurious correlations between questions and answers may lead the model to preset a specific answer which restricts the diversity and generalizability of its reasoning process. In this paper, we propose Cascading Decomposed CoTs Distillation (CasCoD) to address these issues by decomposing the traditional single-step learning process into two cascaded learning steps. Specifically, by restructuring the training objectives -- removing the answer from outputs and concatenating the question with the rationale as input -- CasCoD's two-step learning process ensures that students focus on learning rationales without interference from the preset answers, thus improving reasoning generalizability. Extensive experiments demonstrate the effectiveness of CasCoD on both IND and OOD benchmark reasoning datasets. Code can be found at https://github.com/C-W-D/CasCoD.

5/31/2024

cs.CL cs.AI

Keypoint-based Progressive Chain-of-Thought Distillation for LLMs

Kaituo Feng, Changsheng Li, Xiaolu Zhang, Jun Zhou, Ye Yuan, Guoren Wang

Chain-of-thought distillation is a powerful technique for transferring reasoning abilities from large language models (LLMs) to smaller student models. Previous methods typically require the student to mimic the step-by-step rationale produced by LLMs, often facing the following challenges: (i) Tokens within a rationale vary in significance, and treating them equally may fail to accurately mimic keypoint tokens, leading to reasoning errors. (ii) They usually distill knowledge by consistently predicting all the steps in a rationale, which falls short in distinguishing the learning order of step generation. This diverges from the human cognitive progression of starting with easy tasks and advancing to harder ones, resulting in sub-optimal outcomes. To this end, we propose a unified framework, called KPOD, to address these issues. Specifically, we propose a token weighting module utilizing mask learning to encourage accurate mimicry of keypoint tokens by the student during distillation. Besides, we develop an in-rationale progressive distillation strategy, starting with training the student to generate the final reasoning steps and gradually extending to cover the entire rationale. To accomplish this, a weighted token generation loss is proposed to assess step reasoning difficulty, and a value function is devised to schedule the progressive distillation by considering both step difficulty and question diversity. Extensive experiments on four reasoning benchmarks illustrate our KPOD outperforms previous methods by a large margin.

5/28/2024

cs.CL

💬

Mind's Mirror: Distilling Self-Evaluation Capability and Comprehensive Thinking from Large Language Models

Weize Liu, Guocong Li, Kai Zhang, Bang Du, Qiyuan Chen, Xuming Hu, Hongxia Xu, Jintai Chen, Jian Wu

Large language models (LLMs) have achieved remarkable advancements in natural language processing. However, the massive scale and computational demands of these models present formidable challenges when considering their practical deployment in resource-constrained environments. While techniques such as chain-of-thought (CoT) distillation have displayed promise in distilling LLMs into small language models (SLMs), there is a risk that distilled SLMs may still inherit flawed reasoning and hallucinations from LLMs. To address these issues, we propose a twofold methodology: First, we introduce a novel method for distilling the self-evaluation capability from LLMs into SLMs, aiming to mitigate the adverse effects of flawed reasoning and hallucinations inherited from LLMs. Second, we advocate for distilling more comprehensive thinking by incorporating multiple distinct CoTs and self-evaluation outputs, to ensure a more thorough and robust knowledge transfer into SLMs. Experiments on three NLP benchmarks demonstrate that our method significantly improves the performance of distilled SLMs, offering a new perspective for developing more effective and efficient SLMs in resource-constrained environments.

4/9/2024

cs.CL

On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models

Sree Harsha Tanneru, Dan Ley, Chirag Agarwal, Himabindu Lakkaraju

As Large Language Models (LLMs) are increasingly being employed in real-world applications in critical domains such as healthcare, it is important to ensure that the Chain-of-Thought (CoT) reasoning generated by these models faithfully captures their underlying behavior. While LLMs are known to generate CoT reasoning that is appealing to humans, prior studies have shown that these explanations do not accurately reflect the actual behavior of the underlying LLMs. In this work, we explore the promise of three broad approaches commonly employed to steer the behavior of LLMs to enhance the faithfulness of the CoT reasoning generated by LLMs: in-context learning, fine-tuning, and activation editing. Specifically, we introduce novel strategies for in-context learning, fine-tuning, and activation editing aimed at improving the faithfulness of the CoT reasoning. We then carry out extensive empirical analyses with multiple benchmark datasets to explore the promise of these strategies. Our analyses indicate that these strategies offer limited success in improving the faithfulness of the CoT reasoning, with only slight performance enhancements in controlled scenarios. Activation editing demonstrated minimal success, while fine-tuning and in-context learning achieved marginal improvements that failed to generalize across diverse reasoning and truthful question-answering benchmarks. In summary, our work underscores the inherent difficulty in eliciting faithful CoT reasoning from LLMs, suggesting that the current array of approaches may not be sufficient to address this complex challenge.

6/18/2024

cs.CL