Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!

2406.11629

Published 7/2/2024 by Mingyang Song, Mao Zheng, Xuan Luo

Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!

Abstract

Leveraging Large Language Models (LLMs) as judges for judging the performance of LLMs has recently garnered attention. However, this type of approach is affected by the potential biases in LLMs, raising concerns about the reliability of the evaluation results. To mitigate this issue, we propose and study two versions of many-shot in-context prompts, which rely on two existing settings of many-shot ICL for helping GPT-4o-as-a-Judge in single answer grading to mitigate the potential biases in LLMs, Reinforced ICL and Unsupervised ICL. Concretely, the former utilizes in-context examples with model-generated rationales, and the latter without. Based on the designed prompts, we investigate the impact of scaling the number of in-context examples on the consistency and quality of the judgment results. Furthermore, we reveal the symbol bias hidden in the pairwise comparison of GPT-4o-as-a-Judge and propose a simple yet effective approach to mitigate it. Experimental results show that advanced long-context LLMs, such as GPT-4o, perform better in the many-shot regime than in the zero-shot regime. Meanwhile, the experimental results further verify the effectiveness of the symbol bias mitigation approach.

Create account to get full access

Overview

This paper explores whether "many-shot in-context learning" can help long-context language models (LLMs) perform better at tasks that require understanding and reasoning over long passages of text.
The researchers developed a system called GPT-4o, a long-context LLM that uses in-context learning to better understand and judge long passages.
The paper presents experiments showing that GPT-4o outperforms other long-context LLMs on tasks that require making judgments based on long pieces of text.

Plain English Explanation

The paper investigates whether a technique called "many-shot in-context learning" can improve the ability of large language models to work with and understand long pieces of text. These models, known as long-context LLMs, sometimes struggle with tasks that require reasoning over lengthy passages.

The researchers created a new long-context LLM called GPT-4o that uses many-shot in-context learning. This means the model is trained not just on individual sentences or short texts, but on longer passages that it can learn from and apply that knowledge to make better judgments.

The experiments showed that GPT-4o outperformed other long-context LLMs on tasks where it needed to carefully read and understand long pieces of text in order to make accurate judgments. This suggests that the many-shot in-context learning approach can help these powerful language models become better at handling and reasoning over long-form content.

Technical Explanation

The paper presents GPT-4o, a long-context language model that uses many-shot in-context learning to improve its ability to understand and reason over long passages of text.

The researchers hypothesized that training language models on longer, more diverse examples through in-context learning could help them better grasp and apply knowledge from lengthy textual inputs. To test this, they evaluated GPT-4o on tasks that require integrating information across long contexts, and compared it to other state-of-the-art long-context LLMs.

The results showed that GPT-4o significantly outperformed the baselines, demonstrating the benefits of the many-shot in-context learning approach for long-context understanding. The authors argue that supervised knowledge acquired through this learning process allows the model to more effectively leverage information from lengthy inputs.

Critical Analysis

The paper provides a compelling demonstration of how many-shot in-context learning can enhance the performance of long-context LLMs on tasks requiring deep textual understanding. However, the authors acknowledge that the model still has limitations when it comes to fully exploiting the depth and complexity of long-form content.

Further research could explore ways to even more effectively leverage context and build more robust reasoning capabilities into these models. Addressing potential biases or inconsistencies that may arise from the in-context learning process would also be an important area for future work.

Overall, this work represents an important step forward in enhancing the text comprehension abilities of large language models, with promising implications for applications ranging from summarization to question answering.

Conclusion

This paper presents a novel long-context LLM called GPT-4o that uses many-shot in-context learning to significantly outperform other models on tasks requiring deep understanding of lengthy textual inputs. The findings suggest that training language models on more diverse, longer-form examples can help them better leverage contextual information and make more accurate judgments.

While there is still work to be done to fully harness the potential of long-context LLMs, this research demonstrates the value of the many-shot in-context learning approach for enhancing the text comprehension capabilities of these powerful AI systems. The implications could extend to a wide range of natural language processing applications that rely on understanding complex, long-form content.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Many-Shot In-Context Learning

Rishabh Agarwal, Avi Singh, Lei M. Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D. Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, Hugo Larochelle

Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, without any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated examples. To mitigate this limitation, we explore two new settings: Reinforced and Unsupervised ICL. Reinforced ICL uses model-generated chain-of-thought rationales in place of human examples. Unsupervised ICL removes rationales from the prompt altogether, and prompts the model only with domain-specific questions. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. Finally, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning. Our analysis also reveals the limitations of next-token prediction loss as an indicator of downstream ICL performance.

5/24/2024

cs.LG cs.AI cs.CL

Can Few-shot Work in Long-Context? Recycling the Context to Generate Demonstrations

Arie Cattan, Alon Jacovi, Alex Fabrikant, Jonathan Herzig, Roee Aharoni, Hannah Rashkin, Dror Marcus, Avinatan Hassidim, Yossi Matias, Idan Szpektor, Avi Caciularu

Despite recent advancements in Large Language Models (LLMs), their performance on tasks involving long contexts remains sub-optimal. In-Context Learning (ICL) with few-shot examples may be an appealing solution to enhance LLM performance in this scenario; However, naively adding ICL examples with long context introduces challenges, including substantial token overhead added for each few-shot example and context mismatch between the demonstrations and the target query. In this work, we propose to automatically generate few-shot examples for long context QA tasks by recycling contexts. Specifically, given a long input context (1-3k tokens) and a query, we generate additional query-output pairs from the given context as few-shot examples, while introducing the context only once. This ensures that the demonstrations are leveraging the same context as the target query while only adding a small number of tokens to the prompt. We further enhance each demonstration by instructing the model to explicitly identify the relevant paragraphs before the answer, which improves performance while providing fine-grained attribution to the answer source. We apply our method on multiple LLMs and obtain substantial improvements (+23% on average across models) on various QA datasets with long context, especially when the answer lies within the middle of the context. Surprisingly, despite introducing only single-hop ICL examples, LLMs also successfully generalize to multi-hop long-context QA using our approach.

6/26/2024

cs.CL

Many-Shot In-Context Learning in Multimodal Foundation Models

Yixing Jiang, Jeremy Irvin, Ji Hun Wang, Muhammad Ahmed Chaudhry, Jonathan H. Chen, Andrew Y. Ng

Large language models are well-known to be effective at few-shot in-context learning (ICL). Recent advancements in multimodal foundation models have enabled unprecedentedly long context windows, presenting an opportunity to explore their capability to perform ICL with many more demonstrating examples. In this work, we evaluate the performance of multimodal foundation models scaling from few-shot to many-shot ICL. We benchmark GPT-4o and Gemini 1.5 Pro across 10 datasets spanning multiple domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (multi-class, multi-label, and fine-grained classification). We observe that many-shot ICL, including up to almost 2,000 multimodal demonstrating examples, leads to substantial improvements compared to few-shot (<100 examples) ICL across all of the datasets. Further, Gemini 1.5 Pro performance continues to improve log-linearly up to the maximum number of tested examples on many datasets. Given the high inference costs associated with the long prompts required for many-shot ICL, we also explore the impact of batching multiple queries in a single API call. We show that batching up to 50 queries can lead to performance improvements under zero-shot and many-shot ICL, with substantial gains in the zero-shot setting on multiple datasets, while drastically reducing per-query cost and latency. Finally, we measure ICL data efficiency of the models, or the rate at which the models learn from more demonstrating examples. We find that while GPT-4o and Gemini 1.5 Pro achieve similar zero-shot performance across the datasets, Gemini 1.5 Pro exhibits higher ICL data efficiency than GPT-4o on most datasets. Our results suggest that many-shot ICL could enable users to efficiently adapt multimodal foundation models to new applications and domains. Our codebase is publicly available at https://github.com/stanfordmlgroup/ManyICL .

5/17/2024

cs.LG cs.AI cs.CL cs.CV

Long-context LLMs Struggle with Long In-context Learning

Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen

Large Language Models (LLMs) have made significant strides in handling long sequences. Some models like Gemini could even to be capable of dealing with millions of tokens. However, their performance evaluation has largely been confined to metrics like perplexity and synthetic tasks, which may not fully capture their true abilities in more challenging, real-world scenarios. We introduce a benchmark (LongICLBench) for long in-context learning in extreme-label classification using six datasets with 28 to 174 classes and input lengths from 2K to 50K tokens. Our benchmark requires LLMs to comprehend the entire input to recognize the massive label spaces to make correct predictions. We evaluate on 15 long-context LLMs and find that they perform well on less challenging classification tasks with smaller label space and shorter demonstrations. However, they struggle with more challenging task like Discovery with 174 labels, suggesting a gap in their ability to process long, context-rich sequences. Further analysis reveals a bias towards labels presented later in the sequence and a need for improved reasoning over multiple pieces of information. Our study reveals that long context understanding and reasoning is still a challenging task for the existing LLMs. We believe LongICLBench could serve as a more realistic evaluation for the future long-context LLMs.

6/13/2024

cs.CL cs.AI