Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models

Read original: arXiv:2310.07301 - Published 5/24/2024 by Yuchong Sun, Che Liu, Kun Zhou, Jinwen Huang, Ruihua Song, Wayne Xin Zhao, Fuzheng Zhang, Di Zhang, Kun Gai

💬

Overview

This paper introduces Parrot, a solution aimed at enhancing the multi-turn instruction following ability of large language models (LLMs).
The researchers propose an efficient method for collecting multi-turn instructions with human-like queries, and a context-aware preference optimization strategy to improve LLMs' performance on complex queries in multi-turn interactions.
They also introduce a new multi-turn benchmark to quantitatively evaluate LLMs' instruction following capabilities.

Plain English Explanation

Large language models (LLMs) are powerful AI systems that can understand and generate human-like text. However, most studies have overlooked the ability of LLMs to follow multi-turn instructions, where users provide a series of queries or commands over multiple back-and-forth interactions.

The researchers behind this paper wanted to address this gap. They developed a solution called Parrot that aims to enhance LLMs' multi-turn instruction following abilities.

First, the researchers created an efficient way to collect a dataset of multi-turn instructions that mimic how humans would naturally interact, including things like referring back to previous statements (anaphora) and leaving out certain details (ellipsis). Having a high-quality dataset is crucial for training LLMs to handle these types of complex, conversational queries.

Second, the researchers developed a new training strategy called "context-aware preference optimization" to further improve LLMs' performance on multi-turn instructions. This method helps the models better understand and respond to the full context of an ongoing conversation.

Finally, the researchers built a new benchmark to evaluate how well LLMs can follow multi-turn instructions. This will allow researchers to objectively measure progress in this important area of language AI.

Through extensive testing, the researchers found that Parrot can improve current LLMs by up to 7.2% in their multi-turn instruction following abilities. This represents a significant advancement that could lead to more natural and effective interactions between humans and language AI systems.

Technical Explanation

The paper first highlights the lack of research on the multi-turn instruction following capabilities of large language models (LLMs), despite the importance of this skill for real-world applications. The authors then introduce their Parrot solution, which consists of three key components:

Multi-Turn Instruction Dataset: The researchers developed an efficient method to collect a dataset of multi-turn instructions that feature human-like queries, such as anaphora (references to previous statements) and ellipsis (omission of certain details). This high-quality dataset is crucial for training LLMs to handle complex, conversational interactions.
Context-Aware Preference Optimization: The paper proposes a new training strategy called "context-aware preference optimization" to enhance LLMs' performance on multi-turn instructions. This method helps the models better understand and respond to the full context of an ongoing conversation, rather than just individual statements.
Multi-Turn Benchmark: To quantitatively evaluate LLMs' instruction following abilities, the researchers manually built a new multi-turn benchmark derived from existing datasets. This benchmark will allow for more rigorous assessment of progress in this area.

Through extensive experiments, the authors demonstrate that their Parrot solution can improve current LLMs by up to 7.2% in multi-turn instruction following tasks. This significant improvement highlights the importance of addressing this overlooked aspect of language AI and the potential impact of the Parrot approach.

Critical Analysis

The paper presents a well-designed and comprehensive solution to enhance LLMs' multi-turn instruction following capabilities. The researchers' efforts to create a high-quality dataset of multi-turn instructions and develop a context-aware training strategy are particularly noteworthy.

One potential limitation of the research is the size and diversity of the multi-turn benchmark dataset. While the authors state that it was manually built, it's unclear how representative it is of real-world multi-turn interactions. Expanding the benchmark to include a wider range of scenarios and use cases could provide more robust and generalizable insights.

Additionally, the paper does not delve into the specific architectural or algorithmic changes made to the LLMs as part of the Parrot solution. Further details on the model modifications and their impact on performance would be valuable for researchers looking to build upon this work.

Overall, the Parrot approach represents a significant advancement in the field of language AI, and the researchers' emphasis on multi-turn instruction following is a timely and important contribution. As conversational interfaces become more prevalent, the ability of LLMs to understand and respond to complex, context-rich interactions will be increasingly crucial.

Conclusion

This paper introduces Parrot, a solution that enhances the multi-turn instruction following capabilities of large language models (LLMs). The researchers developed an efficient method for collecting a dataset of multi-turn instructions with human-like queries, and a context-aware preference optimization strategy to improve LLMs' performance on complex, conversational interactions.

By creating a new multi-turn benchmark and conducting extensive experiments, the authors demonstrate that Parrot can improve current LLMs by up to 7.2% in multi-turn instruction following tasks. This significant advancement could lead to more natural and effective interactions between humans and language AI systems, with potential applications in areas like conversational assistants, question-answering systems, and task-oriented dialogues.

The Parrot approach represents an important step forward in language AI research, highlighting the need to address the multi-turn instruction following abilities of LLMs. As conversational interfaces continue to evolve, the insights and methods presented in this paper will likely inform future developments in this rapidly advancing field.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models

Yuchong Sun, Che Liu, Kun Zhou, Jinwen Huang, Ruihua Song, Wayne Xin Zhao, Fuzheng Zhang, Di Zhang, Kun Gai

Humans often interact with large language models (LLMs) in multi-turn interaction to obtain desired answers or more information. However, most existing studies overlook the multi-turn instruction following ability of LLMs, in terms of training dataset, training method, and evaluation benchmark. In this paper, we introduce Parrot, a solution aiming to enhance multi-turn instruction following for LLMs. First, we introduce an efficient but effective method for collecting multi-turn instructions that feature human-like queries, such as anaphora and ellipsis. Second, we propose a context-aware preference optimization strategy to further enhance LLMs for complex queries in multi-turn interaction. Moreover, to quantitatively evaluate LLMs in multi-turn instruction following, we manually build a multi-turn benchmark derived from existing ones. Extensive experiments show that Parrot improves current LLMs by up to 7.2% in multi-turn instruction following. Our dataset and codes will be open-sourced to facilitate future research.

5/24/2024

Parrot: Multilingual Visual Instruction Tuning

Hai-Long Sun, Da-Wei Zhou, Yang Li, Shiyin Lu, Chao Yi, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye

The rapid development of Multimodal Large Language Models (MLLMs) like GPT-4V has marked a significant step towards artificial general intelligence. Existing methods mainly focus on aligning vision encoders with LLMs through supervised fine-tuning (SFT) to endow LLMs with multimodal abilities, making MLLMs' inherent ability to react to multiple languages progressively deteriorate as the training process evolves. We empirically find that the imbalanced SFT datasets, primarily composed of English-centric image-text pairs, lead to significantly reduced performance in non-English languages. This is due to the failure of aligning the vision encoder and LLM with multilingual tokens during the SFT process. In this paper, we introduce Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level. Parrot makes the visual tokens condition on diverse language inputs and uses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens. Specifically, to enhance non-English visual tokens alignment, we compute the cross-attention using the initial visual features and textual embeddings, the result of which is then fed into the MoE router to select the most relevant experts. The selected experts subsequently convert the initial visual tokens into language-specific visual tokens. Moreover, considering the current lack of benchmarks for evaluating multilingual capabilities within the field, we collect and make available a Massive Multilingual Multimodal Benchmark which includes 6 languages, 15 categories, and 12,000 questions, named as MMMB. Our method not only demonstrates state-of-the-art performance on multilingual MMBench and MMMB, but also excels across a broad range of multimodal tasks. Both the source code and the training dataset of Parrot will be made publicly available. Code is available at: https://github.com/AIDC-AI/Parrot.

8/13/2024

Parrot: Efficient Serving of LLM-based Applications with Semantic Variable

Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, Lili Qiu

The rise of large language models (LLMs) has enabled LLM-based applications (a.k.a. AI agents or co-pilots), a new software paradigm that combines the strength of LLM and conventional software. Diverse LLM applications from different tenants could design complex workflows using multiple LLM requests to accomplish one task. However, they have to use the over-simplified request-level API provided by today's public LLM services, losing essential application-level information. Public LLM services have to blindly optimize individual LLM requests, leading to sub-optimal end-to-end performance of LLM applications. This paper introduces Parrot, an LLM service system that focuses on the end-to-end experience of LLM-based applications. Parrot proposes Semantic Variable, a unified abstraction to expose application-level knowledge to public LLM services. A Semantic Variable annotates an input/output variable in the prompt of a request, and creates the data pipeline when connecting multiple LLM requests, providing a natural way to program LLM applications. Exposing Semantic Variables to the public LLM service allows it to perform conventional data flow analysis to uncover the correlation across multiple LLM requests. This correlation opens a brand-new optimization space for the end-to-end performance of LLM-based applications. Extensive evaluations demonstrate that Parrot can achieve up to an order-of-magnitude improvement for popular and practical use cases of LLM applications.

5/31/2024

📈

TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild

Huayang Li, Siheng Li, Deng Cai, Longyue Wang, Lemao Liu, Taro Watanabe, Yujiu Yang, Shuming Shi

Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models. We release our dataset, model, and demo to foster future research in the area of multimodal instruction following.

6/4/2024