Policy Improvement using Language Feedback Models

2402.07876

Published 4/22/2024 by Victor Zhong, Dipendra Misra, Xingdi Yuan, Marc-Alexandre C^ot'e

Policy Improvement using Language Feedback Models

Abstract

We introduce Language Feedback Models (LFMs) that identify desirable behaviour - actions that help achieve tasks specified in the instruction - for imitation learning in instruction following. To train LFMs, we obtain feedback from Large Language Models (LLMs) on visual trajectories verbalized to language descriptions. First, by using LFMs to identify desirable behaviour to imitate, we improve in task-completion rate over strong behavioural cloning baselines on three distinct language grounding environments (Touchdown, ScienceWorld, and ALFWorld). Second, LFMs outperform using LLMs as experts to directly predict actions, when controlling for the number of LLM output tokens. Third, LFMs generalize to unseen environments, improving task-completion rate by 3.5-12.0% through one round of adaptation. Finally, LFM can be modified to provide human-interpretable feedback without performance loss, allowing human verification of desirable behaviour for imitation learning.

Create account to get full access

Overview

This research paper explores a novel approach to improving the performance of language models in following complex instructions by incorporating feedback from human users.
The proposed method, called "Policy Improvement using Language Feedback Models," leverages natural language feedback to fine-tune the language model and optimize its policy for better task completion.
The paper presents experiments demonstrating the effectiveness of this approach in improving instruction following capabilities compared to standard language models.

Plain English Explanation

Imagine you're trying to teach a robot how to do a complex task, like making a sandwich. You could show it step-by-step instructions, but the robot might still struggle to understand the nuances and complete the task correctly.

This research paper proposes a new way to address this issue. Instead of just providing the robot with instructions, you can also give it feedback in natural language as it tries to complete the task. For example, you might say "Great job adding the lettuce, but you forgot to spread the mayo."

The researchers developed a system that allows the language model (the robot's "brain") to learn from this feedback. By incorporating the feedback, the model can fine-tune its understanding and improve its ability to follow the instructions accurately.

The key innovation is that the feedback is in plain language, rather than just numerical scores or labels. This allows the model to better understand the specific mistakes it's making and how to correct them, just like a human would learn from feedback.

Through experiments, the researchers showed that this "Policy Improvement using Language Feedback Models" approach can significantly enhance the instruction following capabilities of language models, outperforming standard methods. This could have important applications in fields like robotics, task automation, and even education, where the ability to follow complex instructions is crucial.

Technical Explanation

The paper proposes a novel framework called "Policy Improvement using Language Feedback Models" (PIFM) that leverages natural language feedback to fine-tune and optimize the policy of a language model for better instruction following performance.

The core idea is to train a "Language Feedback Model" (LFM) that can predict the natural language feedback a human would provide given the current state of the task and the language model's actions. This LFM is then used to guide the optimization of the language model's policy, incentivizing it to take actions that would elicit more positive feedback.

Specifically, the authors train the LFM to predict the language feedback using a dataset of human-provided feedback on a set of instructions. They then use this LFM to compute a "feedback reward" for the language model's actions during training, which is used to update the model's policy via reinforcement learning.

The researchers evaluate their approach on two instruction following datasets, demonstrating significant improvements in task completion rates compared to standard language models. They also analyze the types of feedback the LFM learns to predict and how this feedback helps the language model improve its behavior.

Critical Analysis

The PIFM approach presented in this paper is a promising step towards enhancing the instruction following capabilities of language models. By incorporating natural language feedback, the system can learn from more nuanced human guidance, going beyond just numeric rewards or binary labels.

However, the paper does not fully explore the limitations and potential issues with this approach. For example, the quality and consistency of the human feedback may vary, and the LFM's ability to accurately predict feedback could be a critical factor in the overall performance. Additionally, the paper does not discuss how the system would scale to more complex, open-ended tasks that may require more contextual understanding.

Further research could investigate the robustness of the PIFM approach to noisy or diverse feedback, as well as its applicability to a broader range of instruction following scenarios. Exploring ways to combine the language feedback with other forms of guidance, such as demonstrations or step-by-step instructions, could also be a fruitful area for future work.

Overall, this paper represents an important step towards more natural and effective instruction following by language models, but there is still much to be explored in terms of the limitations, scalability, and broader implications of this approach.

Conclusion

The "Policy Improvement using Language Feedback Models" framework presented in this paper offers a novel and promising approach to enhancing the instruction following capabilities of language models. By incorporating natural language feedback from human users, the system can fine-tune the model's policy to better understand and execute complex instructions.

The experiments demonstrate significant performance improvements compared to standard language models, suggesting that this approach could have important applications in fields like robotics, task automation, and education. However, further research is needed to explore the limitations, scalability, and potential combination of this technique with other guidance mechanisms.

As language models continue to advance, developing more natural and effective methods for following instructions will be crucial for unlocking their full potential in real-world applications. The PIFM framework represents an important step in this direction, highlighting the value of incorporating human feedback and guidance to improve the capabilities of these powerful AI systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Large Language Models Enable Automated Formative Feedback in Human-Robot Interaction Tasks

Emily Jensen, Sriram Sankaranarayanan, Bradley Hayes

We claim that LLMs can be paired with formal analysis methods to provide accessible, relevant feedback for HRI tasks. While logic specifications are useful for defining and assessing a task, these representations are not easily interpreted by non-experts. Luckily, LLMs are adept at generating easy-to-understand text that explains difficult concepts. By integrating task assessment outcomes and other contextual information into an LLM prompt, we can effectively synthesize a useful set of recommendations for the learner to improve their performance.

5/28/2024

cs.RO

Let Me Teach You: Pedagogical Foundations of Feedback for Language Models

Beatriz Borges, Niket Tandon, Tanja Kaser, Antoine Bosselut

Natural Language Feedback (NLF) is an increasingly popular mechanism for aligning Large Language Models (LLMs) to human preferences. Despite the diversity of the information it can convey, NLF methods are often hand-designed and arbitrary, with little systematic grounding. At the same time, research in learning sciences has long established several effective feedback models. In this opinion piece, we compile ideas from pedagogy to introduce FELT, a feedback framework for LLMs that outlines various characteristics of the feedback space, and a feedback content taxonomy based on these variables, providing a general mapping of the feedback space. In addition to streamlining NLF designs, FELT also brings out new, unexplored directions for research in NLF. We make our taxonomy available to the community, providing guides and examples for mapping our categorizations to future research.

6/19/2024

cs.CL

💬

Large Language Models as Generalizable Policies for Embodied Tasks

Andrew Szot, Max Schwarzer, Harsh Agrawal, Bogdan Mazoure, Walter Talbott, Katherine Metcalf, Natalie Mackraz, Devon Hjelm, Alexander Toshev

We show that large language models (LLMs) can be adapted to be generalizable policies for embodied visual tasks. Our approach, called Large LAnguage model Reinforcement Learning Policy (LLaRP), adapts a pre-trained frozen LLM to take as input text instructions and visual egocentric observations and output actions directly in the environment. Using reinforcement learning, we train LLaRP to see and act solely through environmental interactions. We show that LLaRP is robust to complex paraphrasings of task instructions and can generalize to new tasks that require novel optimal behavior. In particular, on 1,000 unseen tasks it achieves 42% success rate, 1.7x the success rate of other common learned baselines or zero-shot applications of LLMs. Finally, to aid the community in studying language conditioned, massively multi-task, embodied AI problems we release a novel benchmark, Language Rearrangement, consisting of 150,000 training and 1,000 testing tasks for language-conditioned rearrangement. Video examples of LLaRP in unseen Language Rearrangement instructions are at https://llm-rl.github.io.

4/17/2024

cs.LG cs.AI cs.CL

💬

Learning to Learn Faster from Human Feedback with Language Model Predictive Control

Jacky Liang, Fei Xia, Wenhao Yu, Andy Zeng, Montserrat Gonzalez Arenas, Maria Attarian, Maria Bauza, Matthew Bennice, Alex Bewley, Adil Dostmohamed, Chuyuan Kelly Fu, Nimrod Gileadi, Marissa Giustina, Keerthana Gopalakrishnan, Leonard Hasenclever, Jan Humplik, Jasmine Hsu, Nikhil Joshi, Ben Jyenis, Chase Kew, Sean Kirmani, Tsang-Wei Edward Lee, Kuang-Huei Lee, Assaf Hurwitz Michaely, Joss Moore, Ken Oslund, Dushyant Rao, Allen Ren, Baruch Tabanpour, Quan Vuong, Ayzaan Wahid, Ted Xiao, Ying Xu, Vincent Zhuang, Peng Xu, Erik Frey, Ken Caluwaerts, Tingnan Zhang, Brian Ichter, Jonathan Tompson, Leila Takayama, Vincent Vanhoucke, Izhak Shafran, Maja Mataric, Dorsa Sadigh, Nicolas Heess, Kanishka Rao, Nik Stewart, Jie Tan, Carolina Parada

Large language models (LLMs) have been shown to exhibit a wide range of capabilities, such as writing robot code from language commands -- enabling non-experts to direct robot behaviors, modify them based on feedback, or compose them to perform new tasks. However, these capabilities (driven by in-context learning) are limited to short-term interactions, where users' feedback remains relevant for only as long as it fits within the context size of the LLM, and can be forgotten over longer interactions. In this work, we investigate fine-tuning the robot code-writing LLMs, to remember their in-context interactions and improve their teachability i.e., how efficiently they adapt to human inputs (measured by average number of corrections before the user considers the task successful). Our key observation is that when human-robot interactions are viewed as a partially observable Markov decision process (in which human language inputs are observations, and robot code outputs are actions), then training an LLM to complete previous interactions is training a transition dynamics model -- that can be combined with classic robotics techniques such as model predictive control (MPC) to discover shorter paths to success. This gives rise to Language Model Predictive Control (LMPC), a framework that fine-tunes PaLM 2 to improve its teachability on 78 tasks across 5 robot embodiments -- improving non-expert teaching success rates of unseen tasks by 26.9% while reducing the average number of human corrections from 2.4 to 1.9. Experiments show that LMPC also produces strong meta-learners, improving the success rate of in-context learning new tasks on unseen robot embodiments and APIs by 31.5%. See videos, code, and demos at: https://robot-teaching.github.io/.

6/3/2024

cs.RO