Large Language Models are In-context Teachers for Knowledge Reasoning

2311.06985

Published 6/18/2024 by Jiachen Zhao, Zonghai Yao, Zhichao Yang, Hong Yu

💬

Abstract

Chain-of-thought (CoT) prompting teaches large language models (LLMs) in context to reason over queries that require more than mere information retrieval. However, human experts are usually required to craft demonstrations for in-context learning (ICL), which is expensive and has high variance. More importantly, how to craft helpful reasoning exemplars for ICL remains unclear. In this work, we investigate whether LLMs can be better in-context teachers for knowledge reasoning. We follow the ``encoding specificity'' hypothesis in human's memory retrieval to assume in-context exemplars at inference should match the encoding context in training data. We are thus motivated to propose Self-Explain to use one LLM's self-elicited explanations as in-context demonstrations for prompting it as they are generalized from the model's training examples. Self-Explain is shown to significantly outperform using human-crafted exemplars and other baselines. We further reveal that for in-context teaching, rationales by distinct teacher LLMs or human experts that more resemble the student LLM's self-explanations are better demonstrations, which supports our encoding specificity hypothesis. We then propose Teach-Back that aligns the teacher LLM with the student to enhance the in-context teaching performance. For example, Teach-Back enables a 7B model to teach the much larger GPT-3.5 in context, surpassing human teachers by around 5% in test accuracy on medical question answering.

Create account to get full access

Overview

This paper explores how large language models (LLMs) can be better "in-context teachers" for knowledge reasoning tasks, rather than relying on human-crafted demonstrations.
The authors propose a method called "Self-Explain" that uses an LLM's own self-generated explanations as the in-context learning examples, based on the "encoding specificity" hypothesis.
The paper also introduces "Teach-Back", which aligns the teaching LLM with the student LLM to further enhance the in-context teaching performance.

Plain English Explanation

The paper looks at how we can improve the way large language models (LLMs) are trained to reason and solve complex problems. Currently, humans have to create example demonstrations for the models to learn from, which is expensive and can be inconsistent.

The researchers had the idea to use the LLM's own self-generated explanations as the examples for in-context learning, rather than relying on human-made ones. This is based on the theory that the model will learn best from examples that match the way it was originally trained on data.

The "Self-Explain" method they developed shows this approach works significantly better than using human-crafted examples. The paper also found that teaching examples from other LLMs or human experts work best when they are similar to the student model's own self-explanations.

Building on this, the researchers created a "Teach-Back" method that aligns the teaching LLM with the student LLM, further boosting the in-context teaching performance. This even allows a smaller 7B model to teach a much larger GPT-3.5 model, outperforming human teachers on a medical question-answering task.

Technical Explanation

The paper investigates whether LLMs can be better "in-context teachers" for knowledge reasoning tasks compared to relying on human-crafted demonstrations. The authors follow the "encoding specificity" hypothesis from human memory research, which suggests that in-context examples at inference should match the encoding context used during training.

They propose a method called "Self-Explain" that uses an LLM's own self-elicited explanations as the in-context learning demonstrations, as these are generalized from the model's training data. Experiments show Self-Explain significantly outperforms using human-crafted exemplars and other baselines.

The paper further reveals that for in-context teaching, rationales from different teacher LLMs or human experts that more resemble the student LLM's self-explanations are better demonstrations, supporting the encoding specificity hypothesis.

Building on this, the authors introduce "Teach-Back", which aligns the teaching LLM with the student LLM to enhance the in-context teaching performance. This enables a 7B model to teach the much larger GPT-3.5 model, surpassing human teachers by around 5% in test accuracy on a medical question-answering task.

Critical Analysis

The paper presents a compelling approach to leveraging LLMs' own self-generated knowledge as a more effective way to provide in-context learning examples, as opposed to relying on human-crafted demonstrations.

One potential limitation is that the encoding specificity hypothesis, while supported by the results, may not fully explain all the dynamics at play. There could be other factors influencing the effectiveness of different teaching examples beyond just their similarity to the student's own reasoning.

Additionally, the Teach-Back method, while effective, relies on aligning the teaching and student models. This alignment process may be challenging to scale or generalize to arbitrary model pairings. Further research is needed to understand the broader applicability and potential issues with this approach.

Another area for exploration is how the self-explanations used as in-context examples are generated. The paper does not delve into details on this process, and there may be opportunities to further optimize or personalize the self-explanation generation to enhance the teaching effectiveness.

Overall, the paper makes a valuable contribution by demonstrating the potential of leveraging LLMs' own internal knowledge representations for more effective in-context learning. The findings and proposed methods warrant further investigation and refinement to unlock the full potential of this approach.

Conclusion

This paper presents a novel approach to improving in-context learning for large language models, moving away from reliance on human-crafted demonstrations. The key idea is to use the LLM's own self-generated explanations as the in-context learning examples, based on the encoding specificity hypothesis.

The authors show that this "Self-Explain" method significantly outperforms using human-made exemplars, and that in-context teaching is most effective when the rationales match the student LLM's own self-explanations. They further develop a "Teach-Back" technique to align the teaching and student models, enabling a smaller 7B model to surpass human teachers on a medical question-answering task.

These findings have important implications for making large language models more capable of reasoning and knowledge transfer, reducing the need for costly human involvement. The proposed approaches warrant further exploration and refinement to unlock the full potential of LLMs as self-teaching, generative AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Supervised Knowledge Makes Large Language Models Better In-context Learners

Linyi Yang, Shuibai Zhang, Zhuohao Yu, Guangsheng Bao, Yidong Wang, Jindong Wang, Ruochen Xu, Wei Ye, Xing Xie, Weizhu Chen, Yue Zhang

Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering. The recent progress in large-scale generative models has further expanded their use in real-world language applications. However, the critical challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored. While previous in-context learning research has focused on enhancing models to adhere to users' specific instructions and quality expectations, and to avoid undesired outputs, little to no work has explored the use of task-Specific fine-tuned Language Models (SLMs) to improve LLMs' in-context learning during the inference stage. Our primary contribution is the establishment of a simple yet effective framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks. Using our proposed plug-in method, enhanced versions of Llama 2 and ChatGPT surpass their original versions regarding generalizability and factuality. We offer a comprehensive suite of resources, including 16 curated datasets, prompts, model checkpoints, and LLM outputs across 9 distinct tasks. The code and data are released at: https://github.com/YangLinyi/Supervised-Knowledge-Makes-Large-Language-Models-Better-In-context-Learners. Our empirical analysis sheds light on the advantages of incorporating discriminative models into LLMs and highlights the potential of our methodology in fostering more reliable LLMs.

4/12/2024

cs.CL cs.AI

📈

An Empirical Study of In-context Learning in LLMs for Machine Translation

Pranjal A. Chitale, Jay Gala, Raj Dabre

Recent interest has surged in employing Large Language Models (LLMs) for machine translation (MT) via in-context learning (ICL) (Vilar et al., 2023). Most prior studies primarily focus on optimizing translation quality, with limited attention to understanding the specific aspects of ICL that influence the said quality. To this end, we perform the first of its kind, an exhaustive study of in-context learning for machine translation. We first establish that ICL is primarily example-driven and not instruction-driven. Following this, we conduct an extensive exploration of various aspects of the examples to understand their influence on downstream performance. Our analysis includes factors such as quality and quantity of demonstrations, spatial proximity, and source versus target originality. Further, we also investigate challenging scenarios involving indirectness and misalignment of examples to understand the limits of ICL. While we establish the significance of the quality of the target distribution over the source distribution of demonstrations, we further observe that perturbations sometimes act as regularizers, resulting in performance improvements. Surprisingly, ICL does not necessitate examples from the same task, and a related task with the same target distribution proves sufficient. We hope that our study acts as a guiding resource for considerations in utilizing ICL for MT. Our code is available on https://github.com/PranjalChitale/in-context-mt-analysis.

6/6/2024

cs.CL

💬

Hint-enhanced In-Context Learning wakes Large Language Models up for knowledge-intensive tasks

Yifan Wang, Qingyan Guo, Xinzhe Ni, Chufan Shi, Lemao Liu, Haiyun Jiang, Yujiu Yang

In-context learning (ICL) ability has emerged with the increasing scale of large language models (LLMs), enabling them to learn input-label mappings from demonstrations and perform well on downstream tasks. However, under the standard ICL setting, LLMs may sometimes neglect query-related information in demonstrations, leading to incorrect predictions. To address this limitation, we propose a new paradigm called Hint-enhanced In-Context Learning (HICL) to explore the power of ICL in open-domain question answering, an important form in knowledge-intensive tasks. HICL leverages LLMs' reasoning ability to extract query-related knowledge from demonstrations, then concatenates the knowledge to prompt LLMs in a more explicit way. Furthermore, we track the source of this knowledge to identify specific examples, and introduce a Hint-related Example Retriever (HER) to select informative examples for enhanced demonstrations. We evaluate HICL with HER on 3 open-domain QA benchmarks, and observe average performance gains of 2.89 EM score and 2.52 F1 score on gpt-3.5-turbo, 7.62 EM score and 7.27 F1 score on LLaMA-2-Chat-7B compared with standard setting.

4/19/2024

cs.CL

🌀

In-context Learning Generalizes, But Not Always Robustly: The Case of Syntax

Aaron Mueller, Albert Webson, Jackson Petty, Tal Linzen

In-context learning (ICL) is now a common method for teaching large language models (LLMs) new tasks: given labeled examples in the input context, the LLM learns to perform the task without weight updates. Do models guided via ICL infer the underlying structure of the task defined by the context, or do they rely on superficial heuristics that only generalize to identically distributed examples? We address this question using transformations tasks and an NLI task that assess sensitivity to syntax - a requirement for robust language understanding. We further investigate whether out-of-distribution generalization can be improved via chain-of-thought prompting, where the model is provided with a sequence of intermediate computation steps that illustrate how the task ought to be performed. In experiments with models from the GPT, PaLM, and Llama 2 families, we find large variance across LMs. The variance is explained more by the composition of the pre-training corpus and supervision methods than by model size; in particular, models pre-trained on code generalize better, and benefit more from chain-of-thought prompting.

4/11/2024

cs.CL