Where is the answer? Investigating Positional Bias in Language Model Knowledge Extraction

2402.12170

Published 5/24/2024 by Kuniaki Saito, Kihyuk Sohn, Chen-Yu Lee, Yoshitaka Ushiku

💬

Abstract

Large language models require updates to remain up-to-date or adapt to new domains by fine-tuning them with new documents. One key is memorizing the latest information in a way that the memorized information is extractable with a query prompt. However, LLMs suffer from a phenomenon called perplexity curse; despite minimizing document perplexity during fine-tuning, LLMs struggle to extract information through a prompt sentence. In this new knowledge acquisition and extraction, we find a very intriguing fact that LLMs can accurately answer questions about the first sentence, but they struggle to extract information described in the middle or end of the documents used for fine-tuning. Our study suggests that the auto-regressive training causes this issue; each token is prompted by reliance on all previous tokens, which hinders the model from recalling information from training documents by question prompts. To conduct the in-depth study, we publish both synthetic and real datasets, enabling the evaluation of the QA performance w.r.t. the position of the corresponding answer in a document. Our investigation shows that even a large model suffers from the perplexity curse, but regularization such as denoising auto-regressive loss can enhance the information extraction from diverse positions. These findings will be (i) a key to improving knowledge extraction from LLMs and (ii) new elements to discuss the trade-off between RAG and fine-tuning in adapting LLMs to a new domain.

Create account to get full access

Overview

Large language models (LLMs) need to be updated or fine-tuned with new documents to stay current and adapt to new domains.
A key challenge is memorizing the latest information in a way that it can be easily extracted through a query prompt.
However, LLMs suffer from a phenomenon called the "perplexity curse," where minimizing document perplexity during fine-tuning does not necessarily lead to effective information extraction via prompts.
Interestingly, LLMs can accurately answer questions about the first sentence, but struggle to extract information from the middle or end of the documents used for fine-tuning.
This issue is attributed to the autoregressive training of LLMs, where each token is predicted based on all previous tokens, making it difficult to recall information from specific parts of the training documents.

Plain English Explanation

Large language models, such as GPT-3, are powerful tools for natural language processing, but they need to be regularly updated or fine-tuned with new information to remain relevant and useful. One of the key challenges is finding a way to effectively store and extract the latest knowledge from these models.

In this paper, the researchers discovered an intriguing phenomenon: while LLMs can accurately answer questions about the first sentence of a document, they struggle to extract information from the middle or end of the document, even after fine-tuning on that document. This is because the way LLMs are trained, where each word is predicted based on all the previous words, makes it difficult for them to recall specific details from different parts of the training data.

To better understand this issue, the researchers created synthetic and real-world datasets to evaluate how the position of information within a document affects the model's ability to answer questions about it. Their findings suggest that even large language models suffer from this "perplexity curse," but that techniques like denoising autoregressive loss can help improve the model's ability to extract information from diverse positions within the text.

These insights are important for improving the way knowledge is extracted from LLMs and for understanding the tradeoffs between different approaches to adapting these models to new domains, such as fine-tuning versus other techniques like Retrieval-Augmented Generation (RAG).

Technical Explanation

The researchers in this study investigate a phenomenon known as the "perplexity curse" in large language models (LLMs). They found that despite minimizing document perplexity during fine-tuning, LLMs struggle to extract information through a query prompt, especially for information located in the middle or end of the documents used for fine-tuning.

To conduct a deeper analysis, the researchers published both synthetic and real-world datasets to evaluate the question-answering (QA) performance of LLMs with respect to the position of the answer within the document. Their investigation revealed that even large language models suffer from the perplexity curse, where the autoregressive training approach (predicting each token based on all previous tokens) hinders the model's ability to recall information from specific parts of the training documents.

The researchers suggest that this issue is caused by the autoregressive nature of LLM training, which makes it difficult for the models to extract information from diverse positions within the training documents. However, they found that regularization techniques, such as denoising autoregressive loss, can help enhance the models' ability to retrieve information from various positions within the text.

These findings have important implications for improving knowledge extraction from LLMs and for understanding the tradeoffs between different approaches to adapting these models to new domains, such as fine-tuning versus Retrieval-Augmented Generation (RAG).

Critical Analysis

The researchers provide a thoughtful analysis of the challenges faced by large language models in effectively extracting information from the training documents, particularly when the relevant information is located in the middle or end of the text. Their findings highlight the limitations of the autoregressive training approach and suggest that more research is needed to address the "perplexity curse" that plagues these models.

One potential area for further investigation could be exploring alternative training approaches or architectural designs that can better capture and recall information from diverse positions within the training data. Additionally, the researchers could delve deeper into the implications of their findings for real-world applications, such as question-answering systems or knowledge-base construction, and explore ways to mitigate the identified issues.

While the researchers have made a valuable contribution to our understanding of LLM behavior, it is essential to continue challenging and questioning aspects of the research to push the field forward and ensure the development of robust and reliable language models that can effectively extract and utilize knowledge from diverse sources.

Conclusion

This study sheds light on a significant challenge facing large language models: the "perplexity curse" that hinders their ability to effectively extract information from the middle or end of documents used for fine-tuning, despite their strong performance on the first sentence.

The researchers' findings have important implications for improving knowledge extraction from LLMs and understanding the tradeoffs between different approaches to adapting these models to new domains. Their insights suggest that addressing the autoregressive training issue and exploring regularization techniques, such as denoising autoregressive loss, could be promising avenues for enhancing the information extraction capabilities of large language models.

As the field of natural language processing continues to evolve, it is crucial to continue exploring these challenges and pushing the boundaries of what is possible with large language models. The researchers' work contributes valuable knowledge to this ongoing effort and lays the groundwork for future advancements in the field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Mitigate Position Bias in Large Language Models via Scaling a Single Dimension

Yijiong Yu, Huiqiang Jiang, Xufang Luo, Qianhui Wu, Chin-Yew Lin, Dongsheng Li, Yuqing Yang, Yongfeng Huang, Lili Qiu

Large Language Models (LLMs) are increasingly applied in various real-world scenarios due to their excellent generalization capabilities and robust generative abilities. However, they exhibit position bias, also known as lost in the middle, a phenomenon that is especially pronounced in long-context scenarios, which indicates the placement of the key information in different positions of a prompt can significantly affect accuracy. This paper first explores the micro-level manifestations of position bias, concluding that attention weights are a micro-level expression of position bias. It further identifies that, in addition to position embeddings, causal attention mask also contributes to position bias by creating position-specific hidden states. Based on these insights, we propose a method to mitigate position bias by scaling this positional hidden states. Experiments on the NaturalQuestions Multi-document QA, KV retrieval, LongBench and timeline reorder tasks, using various models including RoPE models, context windowextended models, and Alibi models, demonstrate the effectiveness and generalizability of our approach. Our method can improve performance by up to 15.2% by modifying just one dimension of hidden states. Our code is available at https://aka.ms/PositionalHidden.

6/5/2024

cs.CL cs.LG

Insights into LLM Long-Context Failures: When Transformers Know but Don't Tell

Taiming Lu, Muhan Gao, Kuai Yu, Adam Byerly, Daniel Khashabi

Large Language Models (LLMs) exhibit positional bias, struggling to utilize information from the middle or end of long contexts. Our study explores LLMs' long-context reasoning by probing their hidden representations. We find that while LLMs encode the position of target information, they often fail to leverage this in generating accurate responses. This reveals a disconnect between information retrieval and utilization, a know but don't tell phenomenon. We further analyze the relationship between extraction time and final accuracy, offering insights into the underlying mechanics of transformer models.

6/24/2024

cs.CL

Position-Aware Parameter Efficient Fine-Tuning Approach for Reducing Positional Bias in LLMs

Zheng Zhang, Fan Yang, Ziyan Jiang, Zheng Chen, Zhengyang Zhao, Chengyuan Ma, Liang Zhao, Yang Liu

Recent advances in large language models (LLMs) have enhanced their ability to process long input contexts. This development is particularly crucial for tasks that involve retrieving knowledge from an external datastore, which can result in long inputs. However, recent studies show a positional bias in LLMs, demonstrating varying performance depending on the location of useful information within the input sequence. In this study, we conduct extensive experiments to investigate the root causes of positional bias. Our findings indicate that the primary contributor to LLM positional bias stems from the inherent positional preferences of different models. We demonstrate that merely employing prompt-based solutions is inadequate for overcoming the positional preferences. To address this positional bias issue of a pre-trained LLM, we developed a Position-Aware Parameter Efficient Fine-Tuning (PAPEFT) approach which is composed of a data augmentation technique and a parameter efficient adapter, enhancing a uniform attention distribution across the input context. Our experiments demonstrate that the proposed approach effectively reduces positional bias, improving LLMs' effectiveness in handling long context sequences for various tasks that require externally retrieved knowledge.

4/3/2024

cs.CL cs.AI cs.LG

Found in the Middle: Calibrating Positional Attention Bias Improves Long Context Utilization

Cheng-Yu Hsieh, Yung-Sung Chuang, Chun-Liang Li, Zifeng Wang, Long T. Le, Abhishek Kumar, James Glass, Alexander Ratner, Chen-Yu Lee, Ranjay Krishna, Tomas Pfister

Large language models (LLMs), even when specifically trained to process long input contexts, struggle to capture relevant information located in the middle of their input. This phenomenon has been known as the lost-in-the-middle problem. In this work, we make three contributions. First, we set out to understand the factors that cause this phenomenon. In doing so, we establish a connection between lost-in-the-middle to LLMs' intrinsic attention bias: LLMs exhibit a U-shaped attention bias where the tokens at the beginning and at the end of its input receive higher attention, regardless of their relevance. Second, we mitigate this positional bias through a calibration mechanism, found-in-the-middle, that allows the model to attend to contexts faithfully according to their relevance, even though when they are in the middle. Third, we show found-in-the-middle not only achieves better performance in locating relevant information within a long context, but also eventually leads to improved retrieval-augmented generation (RAG) performance across various tasks, outperforming existing methods by up to 15 percentage points. These findings open up future directions in understanding LLM attention bias and its potential consequences.

6/26/2024

cs.CL cs.AI cs.LG