Reasons to Reject? Aligning Language Models with Judgments

Read original: arXiv:2312.14591 - Published 6/7/2024 by Weiwen Xu, Deng Cai, Zhisong Zhang, Wai Lam, Shuming Shi

Reasons to Reject? Aligning Language Models with Judgments

Overview

This paper investigates ways to align language models with human judgments and preferences.
The researchers explore methods for collecting feedback from humans on language model outputs and using that feedback to fine-tune the models.
Key themes include mitigating linguistic discrimination, optimizing educational content, and understanding the dynamics of aligning language models with human preferences.

Plain English Explanation

Large language models like GPT-3 have become incredibly capable at generating human-like text. However, these models don't always align with human preferences and values. The researchers in this paper explore ways to "train" language models to better match human judgments.

One key approach is to collect feedback from people on the outputs of language models. For example, they might show people text generated by a model and ask them to rate how appropriate or desirable it is. By gathering this type of feedback, the researchers can then fine-tune the model to produce outputs that more closely match human preferences.

This type of alignment is important for a few reasons. First, it can help mitigate issues like linguistic discrimination - where language models exhibit biases against certain demographic groups. Second, it allows the models to be used more effectively in applications like generating educational content, where the output needs to be tailored to human needs. And more broadly, it's a crucial step in understanding how to align these powerful AI systems with human values and preferences.

Technical Explanation

The paper explores several methods for collecting feedback from humans on language model outputs and using that feedback to fine-tune the models. One approach is to show people generated text and have them rate it on various dimensions like quality, safety, and appropriateness. Another is to have humans provide free-form feedback or annotations on the outputs.

The researchers then investigate different ways of incorporating this human feedback to update the language models. This includes techniques like reward modeling, where the model is trained to maximize a reward signal derived from the human judgments. They also explore other fine-tuning approaches and analyze the dynamics of how the model's behavior changes over the course of the alignment process.

Importantly, the paper examines issues of linguistic discrimination in language models and strategies for mitigating these biases. It also looks at applications like generating educational content where alignment with human preferences is critical.

Critical Analysis

The paper presents a thoughtful and rigorous exploration of the challenge of aligning language models with human judgments and preferences. The researchers utilize a variety of techniques for collecting feedback and fine-tuning the models, which is a valuable contribution to the field.

That said, the paper acknowledges some key limitations and areas for further research. For example, the feedback collection process relies on a relatively small set of human raters, which may not fully capture the diversity of human perspectives. There are also open questions about the scalability and robustness of the fine-tuning approaches, especially as the models grow larger and more complex.

Additionally, while the paper touches on issues of bias and fairness, there may be deeper societal and ethical considerations that warrant further examination. Evaluating and mitigating linguistic discrimination in large language models is an area that deserves continued scrutiny and research.

Overall, this paper represents an important step forward in understanding the learning dynamics of aligning language models with human feedback. The insights and methods presented here could help pave the way for more robust and socially responsible language AI systems.

Conclusion

This paper explores innovative approaches for aligning large language models with human judgments and preferences. By collecting feedback from people on model outputs and using that to fine-tune the models, the researchers demonstrate promising techniques for imbuing these powerful AI systems with greater alignment to human values.

The work has important implications for mitigating issues like linguistic bias, optimizing language models for educational and other high-stakes applications, and more broadly, understanding how to create AI systems that reliably reflect human preferences. As language models continue to advance, this type of research will be crucial for ensuring they are developed and deployed in a responsible and socially-conscious manner.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

Reasons to Reject? Aligning Language Models with Judgments

Weiwen Xu, Deng Cai, Zhisong Zhang, Wai Lam, Shuming Shi

As humans, we consistently interact with our peers and receive feedback in the form of natural language. This language feedback allows us to maintain appropriate behavior, and rectify potential errors. The question arises naturally: can we use language feedback to align large language models (LLMs)? In contrast to previous research that aligns LLMs with scalar rewards, we present the first systematic exploration of alignment through the lens of language feedback (i.e., judgment). We start with an in-depth investigation of potential methods that can be adapted for aligning LLMs with judgments, revealing that these methods cannot fully capitalize on judgments. To facilitate more effective utilization of judgments, we propose a novel framework, Contrastive Unlikelihood Training (CUT), that allows for fine-grained inappropriate content detection and correction based on judgments. Our results show that, with merely 1317 off-the-shelf judgment data, CUT (LLaMA2-13b) can beat the 175B DaVinci003 and surpass the best baseline by 50.84 points on AlpacaEval. CUT (LLaMA2-chat-13b) can also align LLMs in an iterative fashion using up-to-date model-specific judgments, improving performance from 81.09 to 91.68 points on AlpacaEval. Further analysis suggests that judgments hold greater potential than rewards in LLM alignment.

6/7/2024

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge

470

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge

Tianhao Wu, Weizhe Yuan, Olga Golovneva, Jing Xu, Yuandong Tian, Jiantao Jiao, Jason Weston, Sainbayar Sukhbaatar

Large Language Models (LLMs) are rapidly surpassing human knowledge in many domains. While improving these models traditionally relies on costly human data, recent self-rewarding mechanisms (Yuan et al., 2024) have shown that LLMs can improve by judging their own responses instead of relying on human labelers. However, existing methods have primarily focused on improving model responses rather than judgment capabilities, resulting in rapid saturation during iterative training. To address this issue, we introduce a novel Meta-Rewarding step to the self-improvement process, where the model judges its own judgements and uses that feedback to refine its judgment skills. Surprisingly, this unsupervised approach improves the model's ability to judge {em and} follow instructions, as demonstrated by a win rate improvement of Llama-3-8B-Instruct from 22.9% to 39.4% on AlpacaEval 2, and 20.6% to 29.1% on Arena-Hard. These results strongly suggest the potential for self-improving models without human supervision.

7/31/2024

Self-Judge: Selective Instruction Following with Alignment Self-Evaluation

Hai Ye, Hwee Tou Ng

Pre-trained large language models (LLMs) can be tailored to adhere to human instructions through instruction tuning. However, due to shifts in the distribution of test-time data, they may not always execute instructions accurately, potentially generating factual errors or misaligned content when acting as chat assistants. To enhance the reliability of LLMs in following instructions, we propose the study of selective instruction following, whereby the system declines to execute instructions if the anticipated response quality is low. We train judge models that can predict numerical quality scores for model responses. To address data scarcity, we introduce Self-J, a novel self-training framework for developing judge models without needing human-annotated quality scores. Our method leverages the model's inherent self-evaluation capability to extract information about response quality from labeled instruction-tuning data. It incorporates a gold reference answer to facilitate self-evaluation and recalibrates by assessing the semantic similarity between the response sample and the gold reference. During the training phase, we implement self-distillation as a regularization technique to enhance the capability of reference-free estimation. To validate alignment evaluation on general instruction-following tasks, we collect large-scale high-quality instructions from Hugging Face for model training and evaluation. Extensive experiments on five open-source models show that our method correlates much more with GPT-4 than strong baselines, e.g., supervised models distilled from GPT-4 and GPT-3.5-turbo. Our analysis shows our model's strong generalization across domains. Additionally, our judge models serve as good reward models, e.g., boosting WizardLM-13B-V1.2 from 89.17 to 92.48 and from 12.03 to 15.90 in version v1 and v2 of AlpacaEval respectively using best-of-32 sampling with our judge models.

9/4/2024

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, Dieuwke Hupkes

Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models (LLMs). However, there are still many open questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. In this paper, we present a comprehensive study of the performance of various LLMs acting as judges. We leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which we found to have a high inter-annotator agreement. Our study includes 9 judge models and 9 exam taker models -- both base and instruction-tuned. We assess the judge model's alignment across different model sizes, families, and judge prompts. Among other results, our research rediscovers the importance of using Cohen's kappa as a metric of alignment as opposed to simple percent agreement, showing that judges with high percent agreement can still assign vastly different scores. We find that both Llama-3 70B and GPT-4 Turbo have an excellent alignment with humans, but in terms of ranking exam taker models, they are outperformed by both JudgeLM-7B and the lexical judge Contains, which have up to 34 points lower human alignment. Through error analysis and various other studies, including the effects of instruction length and leniency bias, we hope to provide valuable lessons for using LLMs as judges in the future.

6/19/2024