Deep Bayesian Active Learning for Preference Modeling in Large Language Models

2406.10023

Published 6/17/2024 by Luckeciano C. Melo, Panagiotis Tigas, Alessandro Abate, Yarin Gal

Deep Bayesian Active Learning for Preference Modeling in Large Language Models

Abstract

Leveraging human preferences for steering the behavior of Large Language Models (LLMs) has demonstrated notable success in recent years. Nonetheless, data selection and labeling are still a bottleneck for these systems, particularly at large scale. Hence, selecting the most informative points for acquiring human feedback may considerably reduce the cost of preference labeling and unleash the further development of LLMs. Bayesian Active Learning provides a principled framework for addressing this challenge and has demonstrated remarkable success in diverse settings. However, previous attempts to employ it for Preference Modeling did not meet such expectations. In this work, we identify that naive epistemic uncertainty estimation leads to the acquisition of redundant samples. We address this by proposing the Bayesian Active Learner for Preference Modeling (BAL-PM), a novel stochastic acquisition policy that not only targets points of high epistemic uncertainty according to the preference model but also seeks to maximize the entropy of the acquired prompt distribution in the feature space spanned by the employed LLM. Notably, our experiments demonstrate that BAL-PM requires 33% to 68% fewer preference labels in two popular human preference datasets and exceeds previous stochastic Bayesian acquisition policies.

Create account to get full access

Overview

This paper explores the use of deep Bayesian active learning to improve preference modeling in large language models (LLMs).
The researchers develop a framework that efficiently elicits user preferences through interactive dialogues, helping LLMs better understand and align with human values.
The approach combines Bayesian optimization with active learning, allowing the model to selectively query users for feedback to refine its preference understanding in a sample-efficient manner.

Plain English Explanation

The paper tackles the challenge of getting large language models (LLMs) to better understand and align with human preferences. LLMs are powerful AI systems that can engage in natural language tasks, but they don't inherently know what humans value or prefer. The researchers propose using a technique called deep Bayesian active learning to address this.

The key idea is to have the LLM engage in interactive dialogues with users, where it strategically asks for feedback to refine its understanding of human preferences. By combining Bayesian optimization (a method for efficient exploration) with active learning (selectively querying for informative data), the model can quickly learn what matters most to people without needing massive amounts of labeled training data.

This could help LLMs become more aligned with human values and preferences, making them better assistants and collaborators. Rather than blindly doing what the model thinks is best, it can actively seek to understand what humans actually want and then adjust its behavior accordingly.

Technical Explanation

The paper introduces a framework for deep Bayesian active learning for preference modeling in large language models. The core components are:

Bayesian Preference Model: The researchers use a Bayesian neural network to model user preferences, allowing the model to capture uncertainty and adapt its understanding through interaction.
Active Learning: The model strategically selects which examples to query users about, focusing on the most informative data points to refine its preference model in a sample-efficient manner. This is inspired by Bayesian optimization techniques for LLM-based acquisition functions.
Preference-Guided Dialogue: The system engages users in an interactive dialogue, using the learned preference model to guide the conversation and elicit valuable feedback, similar to active preference optimization for sample-efficient RLHF.

The experiments demonstrate that this approach can effectively learn user preferences from limited interaction, outperforming standard preference modeling techniques. This builds on prior work on strengthening multimodal LLMs through bootstrapped preference learning and making better use of unlabeled data through Bayesian active learning.

Critical Analysis

The paper presents a promising approach for improving the alignment of large language models with human preferences. However, there are a few limitations and areas for further research:

Scalability: The interactive dialogue approach may not scale well to scenarios with a large user base or complex preference spaces. Developing more efficient preference elicitation strategies could be an important next step.
Robustness: The paper does not extensively explore the robustness of the preference modeling to noisy or adversarial user feedback. Understanding the vulnerabilities of the system and developing mitigation strategies would be valuable.
Ethical Considerations: While the goal of aligning LLMs with human values is laudable, there are important ethical questions around whose preferences should be prioritized and how to ensure fairness and inclusivity in the preference modeling process.

Overall, the paper presents a compelling approach to a crucial challenge in AI alignment, and the proposed framework merits further exploration and refinement.

Conclusion

This paper introduces a deep Bayesian active learning framework for improving preference modeling in large language models. By strategically eliciting feedback from users through interactive dialogues, the model can efficiently learn and adapt to human values and preferences, helping to create LLMs that are better aligned with what people actually want.

While the approach has some limitations, it represents an important step towards developing AI systems that can meaningfully cooperate with humans and work towards shared goals. As language models become increasingly capable and influential, ensuring their values and behaviors are well-matched to human priorities will be critical. The techniques explored in this paper offer a promising direction for addressing this challenge.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

New!Active Preference Learning for Large Language Models

William Muldrew, Peter Hayes, Mingtian Zhang, David Barber

As large language models (LLMs) become more capable, fine-tuning techniques for aligning with human intent are increasingly important. A key consideration for aligning these models is how to most effectively use human resources, or model resources in the case where LLMs themselves are used as oracles. Reinforcement learning from Human or AI preferences (RLHF/RLAIF) is the most prominent example of such a technique, but is complex and often unstable. Direct Preference Optimization (DPO) has recently been proposed as a simpler and more stable alternative. In this work, we develop an active learning strategy for DPO to make better use of preference labels. We propose a practical acquisition function for prompt/completion pairs based on the predictive entropy of the language model and a measure of certainty of the implicit preference model optimized by DPO. We demonstrate how our approach improves both the rate of learning and final performance of fine-tuning on pairwise preference data.

7/1/2024

cs.LG cs.AI cs.CL

Active Preference Inference using Language Models and Probabilistic Reasoning

Wasu Top Piriyakulkij, Volodymyr Kuleshov, Kevin Ellis

Actively inferring user preferences, for example by asking good questions, is important for any human-facing decision-making system. Active inference allows such systems to adapt and personalize themselves to nuanced individual preferences. To enable this ability for instruction-tuned large language models (LLMs), one may prompt them to ask users questions to infer their preferences, transforming the language models into more robust, interactive systems. However, out of the box, these models are not efficient at extracting preferences: the questions they generate are not informative, requiring a high number of user interactions and impeding the usability of the downstream system. In this work, we introduce an inference-time algorithm that helps LLMs quickly infer preferences by using more informative questions. Our algorithm uses a probabilistic model whose conditional distributions are defined by prompting an LLM, and returns questions that optimize expected entropy and expected model change. Results in a simplified interactive web shopping setting with real product items show that an LLM equipped with our entropy reduction algorithm outperforms baselines with the same underlying LLM on task performance while using fewer user interactions.

6/27/2024

cs.CL cs.AI cs.LG

Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

Shenao Zhang, Donghan Yu, Hiteshi Sharma, Ziyi Yang, Shuohang Wang, Hany Hassan, Zhaoran Wang

Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning Large Language Models (LLMs) to adhere to human intentions. Unlike offline alignment with a fixed dataset, online feedback collection from humans or AI on model generations typically leads to more capable reward models and better-aligned LLMs through an iterative process. However, achieving a globally accurate reward model requires systematic exploration to generate diverse responses that span the vast space of natural language. Random sampling from standard reward-maximizing LLMs alone is insufficient to fulfill this requirement. To address this issue, we propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. By solving the inner-level problem with the reparameterized reward function, the resulting algorithm, named Self-Exploring Language Models (SELM), eliminates the need for a separate RM and iteratively updates the LLM with a straightforward objective. Compared to Direct Preference Optimization (DPO), the SELM objective reduces indiscriminate favor of unseen extrapolations and enhances exploration efficiency. Our experimental results demonstrate that when finetuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks such as MT-Bench and AlpacaEval 2.0, as well as various standard academic benchmarks in different settings. Our code and models are available at https://github.com/shenao-zhang/SELM.

5/30/2024

cs.LG cs.AI

Bayesian Optimization with LLM-Based Acquisition Functions for Natural Language Preference Elicitation

David Eric Austin, Anton Korikov, Armin Toroghi, Scott Sanner

Designing preference elicitation (PE) methodologies that can quickly ascertain a user's top item preferences in a cold-start setting is a key challenge for building effective and personalized conversational recommendation (ConvRec) systems. While large language models (LLMs) constitute a novel technology that enables fully natural language (NL) PE dialogues, we hypothesize that monolithic LLM NL-PE approaches lack the multi-turn, decision-theoretic reasoning required to effectively balance the NL exploration and exploitation of user preferences towards an arbitrary item set. In contrast, traditional Bayesian optimization PE methods define theoretically optimal PE strategies, but fail to use NL item descriptions or generate NL queries, unrealistically assuming users can express preferences with direct item ratings and comparisons. To overcome the limitations of both approaches, we formulate NL-PE in a Bayesian Optimization (BO) framework that seeks to generate NL queries which actively elicit natural language feedback to reduce uncertainty over item utilities to identify the best recommendation. We demonstrate our framework in a novel NL-PE algorithm, PEBOL, which uses Natural Language Inference (NLI) between user preference utterances and NL item descriptions to maintain preference beliefs and BO strategies such as Thompson Sampling (TS) and Upper Confidence Bound (UCB) to guide LLM query generation. We numerically evaluate our methods in controlled experiments, finding that PEBOL achieves up to 131% improvement in MAP@10 after 10 turns of cold start NL-PE dialogue compared to monolithic GPT-3.5, despite relying on a much smaller 400M parameter NLI model for preference inference.

5/3/2024

cs.AI cs.CL