Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation

Read original: arXiv:2402.11907 - Published 8/16/2024 by Aiwei Liu, Haoping Bai, Zhiyun Lu, Xiang Kong, Simon Wang, Jiulong Shan, Meng Cao, Lijie Wen

💬

Overview

Aligning large language models (LLMs) with human expectations is an important challenge.
This paper proposes a method to evaluate response preference using output probabilities of response pairs under contrastive prompt pairs.
The proposed method, Direct Large Model Alignment (DLMA), can achieve better performance than existing approaches on LLaMA2-7B and LLaMA2-13B models.
DLMA generates preference data automatically, evaluates it using contrastive prompts, and then aligns the LLM using a self-rewarding score and the DPO algorithm.

Plain English Explanation

The paper addresses the problem of aligning large language models with human expectations without relying on human-annotated preference data. This is an important challenge because we want these powerful language models to behave in ways that align with human values and preferences.

The researchers propose a new method called Direct Large Model Alignment (DLMA). The key idea is to use contrastive prompt pairs to automatically generate preference data, which can then be used to align the language model.

Specifically, DLMA first uses the contrastive prompt pairs to generate preference data. It then evaluates this data using the same contrastive prompt approach and calculates a "self-rewarding score." Finally, it uses an algorithm called DPO to effectively align the language model based on this self-rewarding score.

The researchers show that their DLMA method can outperform existing techniques, like RLHF, without requiring any human-annotated preference data. This is a significant advantage, as gathering high-quality human feedback can be time-consuming and expensive.

Technical Explanation

The paper proposes a method called Direct Large Model Alignment (DLMA) to align large language models (LLMs) with human expectations without relying on human-annotated preference data.

The key innovation is the use of contrastive prompt pairs to automatically generate preference data. The researchers hypothesize that the output probabilities of response pairs under these contrastive prompt pairs can be used to evaluate the response preference, which can then be leveraged to align the LLM.

The DLMA method consists of three main steps:

Preference Data Generation: The researchers use contrastive prompt pairs to automatically generate preference data, which captures the model's preferences between different responses.
Preference Data Evaluation: They then evaluate the generated preference data using the same contrastive prompt approach and calculate a "self-rewarding score" for each preference data point.
Model Alignment: Finally, the researchers use the DPO algorithm to effectively align the LLM by combining the self-rewarding scores.

The researchers evaluate their DLMA method on the LLaMA2-7B and LLaMA2-13B models and show that it can outperform the RLHF approach without requiring any human-annotated preference data.

Critical Analysis

The paper presents a novel and promising approach to aligning LLMs with human expectations without the need for human-annotated preference data. This is a significant advantage, as gathering high-quality human feedback can be challenging and resource-intensive.

However, the paper does not discuss any potential limitations or caveats of the DLMA method. For example, it would be helpful to understand how the performance of DLMA compares to methods that do use human-annotated data, or how robust the method is to different types of contrastive prompt pairs.

Additionally, the paper does not explore potential biases or issues that could arise from the automatic generation of preference data. It would be important to investigate whether the generated data accurately reflects human preferences or if it introduces any systematic biases.

Further research could also explore ways to combine the DLMA approach with human feedback, potentially leveraging the strengths of both methods to achieve even better alignment.

Conclusion

This paper proposes a novel method called Direct Large Model Alignment (DLMA) to align large language models with human expectations without relying on human-annotated preference data. The key innovation is the use of contrastive prompt pairs to automatically generate preference data, which is then used to align the language model.

The researchers show that their DLMA method can outperform existing techniques, such as RLHF, on the LLaMA2-7B and LLaMA2-13B models. This is a significant achievement, as it demonstrates the potential of automatically generated preference data to enable effective language model alignment without the need for extensive human feedback.

The paper's findings have important implications for the development of large language models that are aligned with human values and expectations. By reducing the reliance on human-annotated data, the DLMA method could make the alignment process more scalable and accessible, paving the way for more robust and trustworthy language models in the future.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

💬

Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation

Aiwei Liu, Haoping Bai, Zhiyun Lu, Xiang Kong, Simon Wang, Jiulong Shan, Meng Cao, Lijie Wen

Aligning large language models (LLMs) with human expectations without human-annotated preference data is an important problem. In this paper, we propose a method to evaluate the response preference by using the output probabilities of response pairs under contrastive prompt pairs, which could achieve better performance on LLaMA2-7B and LLaMA2-13B compared to RLAIF. Based on this, we propose an automatic alignment method, Direct Large Model Alignment (DLMA). First, we use contrastive prompt pairs to automatically generate preference data. Then, we continue to evaluate the generated preference data using contrastive prompt pairs and calculate a self-rewarding score. Finally, we use the DPO algorithm to effectively align LLMs by combining this self-rewarding score. In the experimental stage, our DLMA method could surpass the texttt{RLHF} method without relying on human-annotated preference data.

8/16/2024

Aligning Large Language Models with Self-generated Preference Data

Dongyoung Kim, Kimin Lee, Jinwoo Shin, Jaehyung Kim

Aligning large language models (LLMs) with human preferences becomes a key component to obtaining state-of-the-art performance, but it yields a huge cost to construct a large human-annotated preference dataset. To tackle this problem, we propose a new framework that boosts the alignment of LLMs through Self-generated Preference data (Selfie) using only a very small amount of human-annotated preference data. Our key idea is leveraging the human prior knowledge within the small (seed) data and progressively improving the alignment of LLM, by iteratively generating the responses and learning from them with the self-annotated preference data. To be specific, we propose to derive the preference label from the logits of LLM to explicitly extract the model's inherent preference. Compared to the previous approaches using external reward models or implicit in-context learning, we observe that the proposed approach is significantly more effective. In addition, we introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data. Our experimental results demonstrate that the proposed framework significantly boosts the alignment of LLMs. For example, we achieve superior alignment performance on AlpacaEval 2.0 with only 3.3% of the ground-truth preference labels in the Ultrafeedback data compared to the cases using the entire data or state-of-the-art baselines.

6/10/2024

🤷

Aligning Crowd Feedback via Distributional Preference Reward Modeling

Dexun Li, Cong Zhang, Kuicai Dong, Derrick Goh Xin Deik, Ruiming Tang, Yong Liu

Deep Reinforcement Learning is widely used for aligning Large Language Models (LLM) with human preference. However, the conventional reward modelling is predominantly dependent on human annotations provided by a select cohort of individuals. Such dependence may unintentionally result in skewed models that reflect the inclinations of these annotators, thereby failing to adequately represent the wider population's expectations. We propose the Distributional Preference Reward Model (DPRM), a simple yet effective framework to align large language models with diverse human preferences. To this end, we characterize multiple preferences by a categorical distribution and introduce a Bayesian updater to accommodate shifted or new preferences. On top of that, we design an optimal-transportation-based loss to calibrate DPRM to align with the preference distribution. Finally, the expected reward is utilized to fine-tune an LLM policy to generate responses favoured by the population. Our experiments show that DPRM significantly enhances the alignment of LLMs with population preference, yielding more accurate, unbiased, and contextually appropriate responses.

5/31/2024

🤯

Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads

Avelina Asada Hadji-Kyriacou, Ognjen Arandjelovic

Pre-trained Language Models (LMs) exhibit strong zero-shot and in-context learning capabilities; however, their behaviors are often difficult to control. By utilizing Reinforcement Learning from Human Feedback (RLHF), it is possible to fine-tune unsupervised LMs to follow instructions and produce outputs that reflect human preferences. Despite its benefits, RLHF has been shown to potentially harm a language model's reasoning capabilities and introduce artifacts such as hallucinations where the model may fabricate facts. To address this issue we introduce Direct Preference Heads (DPH), a fine-tuning framework that enables LMs to learn human preference signals through an auxiliary reward head without directly affecting the output distribution of the language modeling head. We perform a theoretical analysis of our objective function and find strong ties to Conservative Direct Preference Optimization (cDPO). Finally we evaluate our models on GLUE, RACE, and the GPT4All evaluation suite and demonstrate that our method produces models which achieve higher scores than those fine-tuned with Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO) alone.

5/31/2024