RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models

Read original: arXiv:2402.13463 - Published 7/25/2024 by Jianhao Yan, Yun Luo, Yue Zhang

RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models

Overview

This paper introduces RefuteBench, a new benchmark for evaluating large language models' ability to follow instructions while also refuting inappropriate or unethical requests.
RefuteBench assesses how well models can understand and follow instructions, while also recognizing and refusing requests that are harmful, illegal, or go against the model's principles.
The paper describes the benchmark's design, the datasets used, and the evaluation metrics. It also presents results from testing several prominent language models on RefuteBench.

Plain English Explanation

The paper presents a new way to test how well large language models, like GPT-3 or DALL-E, can follow instructions while also refusing requests that are unethical or harmful. This is important because as these models become more capable, they need to be able to understand instructions and complete tasks, but also recognize when a request is inappropriate and refuse to carry it out.

The researchers created RefuteBench, a benchmark that evaluates a model's ability to both follow instructions and refuse unethical requests. The benchmark includes a variety of tasks and scenarios that test these capabilities.

By testing prominent language models on RefuteBench, the researchers were able to get a sense of how well these models can balance following instructions with recognizing and rejecting harmful requests. This is an important step in ensuring these powerful AI systems are developed and used responsibly.

Technical Explanation

The paper introduces RefuteBench, a new benchmark for evaluating instruction-following and refusal capabilities in large language models. RefuteBench builds on previous work on instruction-following and refusal benchmarks.

The benchmark includes a diverse set of tasks that test a model's ability to understand and follow instructions, as well as its capacity to recognize and refuse unethical, harmful, or illegal requests. The dataset was constructed by having human annotators generate instructions and refutable requests across a variety of domains.

The paper presents results from testing several prominent language models, including GPT-3, on RefuteBench. The evaluation metrics capture both the models' instruction-following accuracy as well as their ability to correctly refuse inappropriate requests. The results provide insights into the current state of these capabilities in large language models and identify areas for improvement.

Critical Analysis

The RefuteBench framework represents an important step forward in assessing the safety and reliability of large language models. By evaluating both instruction-following and refusal capabilities, the benchmark highlights a key challenge in developing AI systems that can be beneficial and trustworthy.

One limitation noted in the paper is the potential for bias in the dataset, as the instructions and refutable requests were generated by human annotators. There may be systematic biases or blind spots in the types of scenarios covered. Expanding the dataset and testing the benchmark's robustness to distribution shifts would be valuable areas for future research.

Additionally, the paper does not delve deeply into the specific techniques or architectural choices that enable strong refusal capabilities in language models. Further work is needed to understand the underlying mechanisms and to develop more principled approaches to imbuing models with robust ethical reasoning.

Overall, RefuteBench represents an important contribution to the ongoing efforts to ensure large language models are developed and deployed responsibly. The insights gained from this benchmark can help guide the field towards more capable and trustworthy AI systems.

Conclusion

The RefuteBench paper presents a new benchmark for evaluating large language models' ability to both follow instructions and refuse unethical or harmful requests. By testing prominent models on this benchmark, the researchers were able to gain insights into the current state of these capabilities and identify areas for improvement.

The development of RefuteBench is a significant step forward in ensuring the responsible development and deployment of powerful AI systems. As language models continue to become more capable, it is crucial that they are able to understand and carry out instructions, while also maintaining robust ethical reasoning to protect against misuse.

The findings from this research can help guide the field towards more trustworthy and beneficial AI assistants that can be reliably deployed in real-world applications. The continued advancement of benchmarks like RefuteBench will be essential for driving progress in this important area of AI safety and reliability.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models

Jianhao Yan, Yun Luo, Yue Zhang

The application scope of large language models (LLMs) is increasingly expanding. In practical use, users might provide feedback based on the model's output, hoping for a responsive model that can complete responses according to their feedback. Whether the model can appropriately respond to users' refuting feedback and consistently follow through with execution has not been thoroughly analyzed. In light of this, this paper proposes a comprehensive benchmark, RefuteBench, covering tasks such as question answering, machine translation, and email writing. The evaluation aims to assess whether models can positively accept feedback in form of refuting instructions and whether they can consistently adhere to user demands throughout the conversation. We conduct evaluations on numerous LLMs and find that LLMs are stubborn, i.e. exhibit inclination to their internal knowledge, often failing to comply with user feedback. Additionally, as the length of the conversation increases, models gradually forget the user's stated feedback and roll back to their own responses. We further propose a recall-and-repeat prompts as a simple and effective way to enhance the model's responsiveness to feedback.

7/25/2024

💬

Evaluating Large Language Models at Evaluating Instruction Following

Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, Danqi Chen

As research in large language models (LLMs) continues to accelerate, LLM-based evaluation has emerged as a scalable and cost-effective alternative to human evaluations for comparing the ever increasing list of models. This paper investigates the efficacy of these ``LLM evaluators'', particularly in using them to assess instruction following, a metric that gauges how closely generated text adheres to the given instruction. We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs. The authors manually curated 419 pairs of outputs, one adhering to instructions while the other diverging, yet may possess deceptive qualities that mislead an LLM evaluator, e.g., a more engaging tone. Contrary to existing meta-evaluation, we discover that different evaluators (i.e., combinations of LLMs and prompts) exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement. We also present a novel suite of prompting strategies that further close the gap between LLM and human evaluators. With LLMBar, we hope to offer more insight into LLM evaluators and foster future research in developing better instruction-following models.

4/17/2024

💬

FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models

Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, Wei Wang

The ability to follow instructions is crucial for Large Language Models (LLMs) to handle various real-world applications. Existing benchmarks primarily focus on evaluating pure response quality, rather than assessing whether the response follows constraints stated in the instruction. To fill this research gap, in this paper, we propose FollowBench, a Multi-level Fine-grained Constraints Following Benchmark for LLMs. FollowBench comprehensively includes five different types (i.e., Content, Situation, Style, Format, and Example) of fine-grained constraints. To enable a precise constraint following estimation on diverse difficulties, we introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level. To assess whether LLMs' outputs have satisfied every individual constraint, we propose to prompt strong LLMs with constraint-evolution paths to handle challenging open-ended instructions. By evaluating 13 closed-source and open-source popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work. The data and code are publicly available at https://github.com/YJiangcm/FollowBench.

6/6/2024

Beyond Instruction Following: Evaluating Rule Following of Large Language Models

Wangtao Sun, Chenxiang Zhang, Xueyou Zhang, Ziyang Huang, Haotian Xu, Pei Chen, Shizhu He, Jun Zhao, Kang Liu

Although Large Language Models (LLMs) have demonstrated strong instruction-following ability, they are further supposed to be controlled and guided by rules in real-world scenarios to be safe, accurate, and intelligent. This demands the possession of inferential rule-following capability of LLMs. However, few works have made a clear evaluation of the inferential rule-following capability of LLMs. Previous studies that try to evaluate the inferential rule-following capability of LLMs fail to distinguish the inferential rule-following scenarios from the instruction-following scenarios. Therefore, this paper first clarifies the concept of inferential rule-following and proposes a comprehensive benchmark, RuleBench, to evaluate a diversified range of inferential rule-following abilities. Our experimental results on a variety of LLMs show that they are still limited in following rules. Our analysis based on the evaluation results provides insights into the improvements for LLMs toward a better inferential rule-following intelligent agent. We further propose Inferential Rule-Following Tuning (IRFT), which outperforms IFT in helping LLMs solve RuleBench. The data and code can be found at: https://anonymous.4open.science/r/llm-rule-following-B3E3/

8/20/2024