Socratic Planner: Inquiry-Based Zero-Shot Planning for Embodied Instruction Following

Read original: arXiv:2404.15190 - Published 4/24/2024 by Suyeon Shin, Sujin jeon, Junghyun Kim, Gi-Cheon Kang, Byoung-Tak Zhang

👨‍🏫

Overview

The paper introduces the Socratic Planner, a novel approach to addressing the challenge of compositional task planning in Embodied Instruction Following (EIF).
EIF is the task of executing natural language instructions by navigating and interacting with objects in 3D environments.
The Socratic Planner is a zero-shot planning method that infers a high-level plan, i.e., a sequence of subgoals, without the need for any training data.
The paper also introduces an evaluation metric called RelaxedHLP for more comprehensive assessment of high-level plans.

Plain English Explanation

The paper focuses on the challenge of Embodied Instruction Following (EIF), which is the task of following natural language instructions by navigating and interacting with objects in 3D virtual environments. One of the key challenges in EIF is compositional task planning, which is often addressed using supervised or in-context learning with labeled data.

To overcome this challenge, the researchers introduce the Socratic Planner, a novel approach that can infer a high-level plan without any prior training data. The Socratic Planner works by first breaking down the instructions into smaller, more manageable subgoals. It does this through a process of self-questioning and answering, similar to the Socratic method of teaching.

Once the subgoals are identified, the Socratic Planner executes them sequentially, dynamically adjusting the plan based on the dense visual feedback from the environment. This allows the system to stay flexible and adapt to changes in the 3D space.

The paper also introduces a new evaluation metric called RelaxedHLP, which the researchers claim provides a more comprehensive assessment of the high-level plans generated by the Socratic Planner and other EIF systems.

Technical Explanation

The Socratic Planner proposed in this paper is a zero-shot planning method that can infer a high-level plan, i.e., a sequence of subgoals, without any prior training data. This is a significant departure from the more common supervised or in-context learning approaches to compositional task planning in Embodied Instruction Following (EIF).

The key innovation of the Socratic Planner is its self-questioning and answering mechanism, which allows it to decompose the instructions into substructural information of the task. This substructural information is then translated into a high-level plan, which is executed sequentially.

During plan execution, the Socratic Planner incorporates dense visual feedback from the environment to dynamically adjust the plan, ensuring that it remains aligned with the current state of the 3D space. This visually grounded re-planning mechanism is a crucial component that enables the system to handle changes and unexpected events during the task execution.

To evaluate the performance of the Socratic Planner and other EIF systems, the paper introduces a new metric called RelaxedHLP. This metric provides a more comprehensive assessment of high-level plans, going beyond the traditional metrics that focus solely on task completion.

Critical Analysis

The Socratic Planner represents an interesting and novel approach to addressing the challenge of compositional task planning in Embodied Instruction Following (EIF). The key strength of the system is its ability to infer a high-level plan without any prior training data, which could make it a valuable tool for real-world applications where labeled data is scarce or expensive to obtain.

However, the paper does not delve into the specific details of how the self-questioning and answering mechanism works, nor does it provide a detailed comparison of the Socratic Planner's performance against other state-of-the-art EIF systems. Additionally, the paper does not address the potential limitations or scalability concerns of the Socratic Planner, such as how it might perform on more complex or open-ended instructions.

Further research and evaluation would be needed to fully assess the Socratic Planner's strengths, weaknesses, and potential for real-world applications. It would also be interesting to see how the Socratic Planner's visually grounded re-planning mechanism compares to other approaches to task execution and dynamic plan adjustment in EIF.

Conclusion

The Socratic Planner introduced in this paper represents a novel approach to addressing the challenge of compositional task planning in Embodied Instruction Following (EIF). By leveraging a self-questioning and answering mechanism, the Socratic Planner can infer a high-level plan without the need for any training data, a significant departure from the more common supervised or in-context learning methods.

The paper also introduces a new evaluation metric, RelaxedHLP, which provides a more comprehensive assessment of high-level plans in EIF. While the Socratic Planner shows promising results, further research and evaluation are needed to fully understand its strengths, limitations, and potential real-world applications.

Overall, this paper makes a valuable contribution to the field of EIF by proposing a novel planning approach and introducing a new evaluation metric, both of which could help drive further advancements in this important area of research.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👨‍🏫

Socratic Planner: Inquiry-Based Zero-Shot Planning for Embodied Instruction Following

Suyeon Shin, Sujin jeon, Junghyun Kim, Gi-Cheon Kang, Byoung-Tak Zhang

Embodied Instruction Following (EIF) is the task of executing natural language instructions by navigating and interacting with objects in 3D environments. One of the primary challenges in EIF is compositional task planning, which is often addressed with supervised or in-context learning with labeled data. To this end, we introduce the Socratic Planner, the first zero-shot planning method that infers without the need for any training data. Socratic Planner first decomposes the instructions into substructural information of the task through self-questioning and answering, translating it into a high-level plan, i.e., a sequence of subgoals. Subgoals are executed sequentially, with our visually grounded re-planning mechanism adjusting plans dynamically through a dense visual feedback. We also introduce an evaluation metric of high-level plans, RelaxedHLP, for a more comprehensive evaluation. Experiments demonstrate the effectiveness of the Socratic Planner, achieving competitive performance on both zero-shot and few-shot task planning in the ALFRED benchmark, particularly excelling in tasks requiring higher-dimensional inference. Additionally, a precise adjustments in the plan were achieved by incorporating environmental visual information.

4/24/2024

Embodied Instruction Following in Unknown Environments

Zhenyu Wu, Ziwei Wang, Xiuwei Xu, Jiwen Lu, Haibin Yan

Enabling embodied agents to complete complex human instructions from natural language is crucial to autonomous systems in household services. Conventional methods can only accomplish human instructions in the known environment where all interactive objects are provided to the embodied agent, and directly deploying the existing approaches for the unknown environment usually generates infeasible plans that manipulate non-existing objects. On the contrary, we propose an embodied instruction following (EIF) method for complex tasks in the unknown environment, where the agent efficiently explores the unknown environment to generate feasible plans with existing objects to accomplish abstract instructions. Specifically, we build a hierarchical embodied instruction following framework including the high-level task planner and the low-level exploration controller with multimodal large language models. We then construct a semantic representation map of the scene with dynamic region attention to demonstrate the known visual clues, where the goal of task planning and scene exploration is aligned for human instruction. For the task planner, we generate the feasible step-by-step plans for human goal accomplishment according to the task completion process and the known visual clues. For the exploration controller, the optimal navigation or object interaction policy is predicted based on the generated step-wise plans and the known visual clues. The experimental results demonstrate that our method can achieve 45.09% success rate in 204 complex human instructions such as making breakfast and tidying rooms in large house-level scenes.

6/18/2024

HAPFI: History-Aware Planning based on Fused Information

Sujin Jeon, Suyeon Shin, Byoung-Tak Zhang

Embodied Instruction Following (EIF) is a task of planning a long sequence of sub-goals given high-level natural language instructions, such as Rinse a slice of lettuce and place on the white table next to the fork. To successfully execute these long-term horizon tasks, we argue that an agent must consider its past, i.e., historical data, when making decisions in each step. Nevertheless, recent approaches in EIF often neglects the knowledge from historical data and also do not effectively utilize information across the modalities. To this end, we propose History-Aware Planning based on Fused Information (HAPFI), effectively leveraging the historical data from diverse modalities that agents collect while interacting with the environment. Specifically, HAPFI integrates multiple modalities, including historical RGB observations, bounding boxes, sub-goals, and high-level instructions, by effectively fusing modalities via our Mutually Attentive Fusion method. Through experiments with diverse comparisons, we show that an agent utilizing historical multi-modal information surpasses all the compared methods that neglect the historical data in terms of action planning capability, enabling the generation of well-informed action plans for the next step. Moreover, we provided qualitative evidence highlighting the significance of leveraging historical multi-modal data, particularly in scenarios where the agent encounters intermediate failures, showcasing its robust re-planning capabilities.

7/24/2024

Semantic Skill Grounding for Embodied Instruction-Following in Cross-Domain Environments

Sangwoo Shin, Seunghyun Kim, Youngsoo Jang, Moontae Lee, Honguk Woo

In embodied instruction-following (EIF), the integration of pretrained language models (LMs) as task planners emerges as a significant branch, where tasks are planned at the skill level by prompting LMs with pretrained skills and user instructions. However, grounding these pretrained skills in different domains remains challenging due to their intricate entanglement with the domain-specific knowledge. To address this challenge, we present a semantic skill grounding (SemGro) framework that leverages the hierarchical nature of semantic skills. SemGro recognizes the broad spectrum of these skills, ranging from short-horizon low-semantic skills that are universally applicable across domains to long-horizon rich-semantic skills that are highly specialized and tailored for particular domains. The framework employs an iterative skill decomposition approach, starting from the higher levels of semantic skill hierarchy and then moving downwards, so as to ground each planned skill to an executable level within the target domain. To do so, we use the reasoning capabilities of LMs for composing and decomposing semantic skills, as well as their multi-modal extension for assessing the skill feasibility in the target domain. Our experiments in the VirtualHome benchmark show the efficacy of SemGro in 300 cross-domain EIF scenarios.

8/22/2024