FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models

2310.20410

Published 6/6/2024 by Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, Wei Wang

cs.CL

💬

Abstract

The ability to follow instructions is crucial for Large Language Models (LLMs) to handle various real-world applications. Existing benchmarks primarily focus on evaluating pure response quality, rather than assessing whether the response follows constraints stated in the instruction. To fill this research gap, in this paper, we propose FollowBench, a Multi-level Fine-grained Constraints Following Benchmark for LLMs. FollowBench comprehensively includes five different types (i.e., Content, Situation, Style, Format, and Example) of fine-grained constraints. To enable a precise constraint following estimation on diverse difficulties, we introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level. To assess whether LLMs' outputs have satisfied every individual constraint, we propose to prompt strong LLMs with constraint-evolution paths to handle challenging open-ended instructions. By evaluating 13 closed-source and open-source popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work. The data and code are publicly available at https://github.com/YJiangcm/FollowBench.

Create account to get full access

Overview

The paper proposes a new benchmark called FollowBench to evaluate how well large language models (LLMs) can follow instructions with fine-grained constraints.
Existing benchmarks focus on evaluating the quality of responses, but not whether the responses follow the instructions given.
FollowBench includes five different types of constraints (content, situation, style, format, and example) and uses a multi-level mechanism to gradually increase the difficulty.
The authors evaluate 13 popular LLMs on FollowBench and find that they struggle with instruction following, highlighting areas for future research.

Plain English Explanation

When we ask a large language model to perform a task, it's important that the model's response follows the specific instructions we provide. However, existing benchmarks tend to focus more on the quality of the response itself, rather than how well the model adheres to the given instructions.

To address this gap, the researchers created a new benchmark called FollowBench. This benchmark includes different types of constraints that the model must follow, such as content requirements, situational context, stylistic guidelines, formatting rules, and example-based instructions. The benchmark uses a multi-level approach, where the difficulty increases as more constraints are added to the initial instruction.

By evaluating 13 popular LLMs on FollowBench, the researchers found that these models struggle with accurately following the instructions, even on relatively simple tasks. This highlights the need for further research and development to improve LLMs' ability to handle complex, constrained instructions, which is crucial for real-world applications.

Technical Explanation

The paper introduces FollowBench, a new benchmark designed to assess how well large language models (LLMs) can follow instructions with fine-grained constraints. Unlike existing benchmarks that focus on response quality, FollowBench specifically evaluates whether the model's output satisfies the constraints stated in the instruction.

FollowBench encompasses five different types of constraints: content, situation, style, format, and example. The researchers use a multi-level mechanism to gradually increase the difficulty, where each level adds a single constraint to the initial instruction. This allows for a more precise assessment of the model's ability to follow instructions of varying complexity.

To evaluate the models, the researchers prompt strong LLMs with constraint-evolution paths, which are sequences of instructions that gradually introduce new constraints. This approach helps to determine whether the model's output satisfies every individual constraint within the instruction.

The researchers evaluated 13 closed-source and open-source LLMs on FollowBench and found that they struggled to accurately follow the instructions, even on relatively simple tasks. This highlights the need for further research and development to improve LLMs' instruction-following capabilities, which are crucial for real-world applications such as teaching information retrieval models or handling complex constrained instructions.

Critical Analysis

The FollowBench benchmark proposed in this paper is a valuable contribution to the field of language model evaluation, as it addresses an important gap in existing benchmarks. By focusing on the ability to follow instructions with fine-grained constraints, the researchers have highlighted a critical weakness in current LLMs that needs to be addressed.

One potential limitation of the FollowBench is the scope of the constraints included. While the five types of constraints (content, situation, style, format, and example) cover a broad range, there may be other types of constraints that are relevant for real-world applications. Additionally, the multi-level approach, while effective, may not fully capture the complexity of real-world instructions, which can involve multiple, intertwined constraints.

Another area for further research could be the exploration of techniques to improve LLMs' instruction-following abilities. The paper does not delve into specific approaches that could be used to address the weaknesses identified, such as fine-tuning on instruction-following datasets or incorporating explicit constraint-following mechanisms into the model architecture.

Overall, the FollowBench benchmark and the insights gained from evaluating popular LLMs are valuable contributions to the ongoing efforts to develop more capable and reliable language models. The research highlights the importance of not just focusing on response quality, but also on the ability to follow instructions, which is crucial for real-world applications.

Conclusion

The paper introduces FollowBench, a novel benchmark for evaluating how well large language models (LLMs) can follow instructions with fine-grained constraints. By including five different types of constraints and using a multi-level mechanism to gradually increase the difficulty, FollowBench provides a comprehensive way to assess LLMs' instruction-following capabilities.

The evaluation of 13 popular LLMs on FollowBench reveals that these models struggle to accurately follow the instructions, even on relatively simple tasks. This finding underscores the need for further research and development to improve LLMs' ability to handle complex, constrained instructions, which is essential for real-world applications such as teaching information retrieval models or handling complex constrained instructions.

By providing a robust benchmark and highlighting the weaknesses of current LLMs in instruction following, this research paves the way for advancements in the field of language model development and their real-world deployment.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

💬

Evaluating Large Language Models at Evaluating Instruction Following

Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, Danqi Chen

As research in large language models (LLMs) continues to accelerate, LLM-based evaluation has emerged as a scalable and cost-effective alternative to human evaluations for comparing the ever increasing list of models. This paper investigates the efficacy of these ``LLM evaluators'', particularly in using them to assess instruction following, a metric that gauges how closely generated text adheres to the given instruction. We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs. The authors manually curated 419 pairs of outputs, one adhering to instructions while the other diverging, yet may possess deceptive qualities that mislead an LLM evaluator, e.g., a more engaging tone. Contrary to existing meta-evaluation, we discover that different evaluators (i.e., combinations of LLMs and prompts) exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement. We also present a novel suite of prompting strategies that further close the gap between LLM and human evaluators. With LLMBar, we hope to offer more insight into LLM evaluators and foster future research in developing better instruction-following models.

4/17/2024

cs.CL cs.LG

RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models

Jianhao Yan, Yun Luo, Yue Zhang

The application scope of large language models (LLMs) is increasingly expanding. In practical use, users might provide feedback based on the model's output, hoping for a responsive model that can complete responses according to their feedback. Whether the model can appropriately respond to users' refuting feedback and consistently follow through with execution has not been thoroughly analyzed. In light of this, this paper proposes a comprehensive benchmark, RefuteBench, covering tasks such as question answering, machine translation, and email writing. The evaluation aims to assess whether models can positively accept feedback in form of refuting instructions and whether they can consistently adhere to user demands throughout the conversation. We conduct evaluations on numerous LLMs and find that LLMs are stubborn, i.e. exhibit inclination to their internal knowledge, often failing to comply with user feedback. Additionally, as the length of the conversation increases, models gradually forget the user's stated feedback and roll back to their own responses. We further propose a recall-and-repeat prompts as a simple and effective way to enhance the model's responsiveness to feedback.

6/5/2024

cs.CL cs.AI

💬

From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models

Qianyu He, Jie Zeng, Qianxi He, Jiaqing Liang, Yanghua Xiao

It is imperative for Large language models (LLMs) to follow instructions with elaborate requirements (i.e. Complex Instructions Following). Yet, it remains under-explored how to enhance the ability of LLMs to follow complex instructions with multiple constraints. To bridge the gap, we initially study what training data is effective in enhancing complex constraints following abilities. We found that training LLMs with instructions containing multiple constraints enhances their understanding of complex instructions, especially those with lower complexity levels. The improvement can even generalize to compositions of out-of-domain constraints. Additionally, we further propose methods addressing how to obtain and utilize the effective training data. Finally, we conduct extensive experiments to prove the effectiveness of our methods in terms of overall performance and training efficiency. We also demonstrate that our methods improve models' ability to follow instructions generally and generalize effectively across out-of-domain, in-domain, and adversarial settings, while maintaining general capabilities.

6/19/2024

cs.CL

CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models

Yizhi LI, Ge Zhang, Xingwei Qu, Jiali Li, Zhaoqun Li, Zekun Wang, Hao Li, Ruibin Yuan, Yinghao Ma, Kai Zhang, Wangchunshu Zhou, Yiming Liang, Lei Zhang, Lei Ma, Jiajun Zhang, Zuowen Li, Stephen W. Huang, Chenghua Lin, Jie Fu

The advancement of large language models (LLMs) has enhanced the ability to generalize across a wide range of unseen natural language processing (NLP) tasks through instruction-following. Yet, their effectiveness often diminishes in low-resource languages like Chinese, exacerbated by biased evaluations from data leakage, casting doubt on their true generalizability to new linguistic territories. In response, we introduce the Chinese Instruction-Following Benchmark (CIF-Bench), designed to evaluate the zero-shot generalizability of LLMs to the Chinese language. CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances across 20 categories. To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance, totaling 45,000 data instances. Our evaluation of 28 selected LLMs reveals a noticeable performance gap, with the best model scoring only 52.9%, highlighting the limitations of LLMs in less familiar language and task contexts. This work not only uncovers the current limitations of LLMs in handling Chinese language tasks but also sets a new standard for future LLM generalizability research, pushing towards the development of more adaptable, culturally informed, and linguistically diverse models.

6/5/2024

cs.CL cs.AI