Odyssey: Empowering Agents with Open-World Skills

Read original: arXiv:2407.15325 - Published 7/23/2024 by Shunyu Liu, Yaoru Li, Kongcheng Zhang, Zhenyu Cui, Wenkai Fang, Yuxuan Zheng, Tongya Zheng, Mingli Song

👀

Overview

Researchers have been exploring the construction of generalist agents for open-world embodied environments like Minecraft.
Existing efforts have mainly focused on solving basic programmatic tasks, such as material collection and tool-crafting, treating the ObtainDiamond task as the ultimate goal.
This limitation stems from the narrowly defined set of actions available to agents, requiring them to learn effective long-horizon strategies from scratch.
Discovering diverse gameplay opportunities in the open world becomes challenging due to this constraint.

Plain English Explanation

The paper introduces a new framework called ODYSSEY that empowers Large Language Model (LLM)-based agents with open-world skills to explore the vast Minecraft world. ODYSSEY comprises three key components:

An interactive agent with an open-world skill library that consists of 40 primitive skills and 183 compositional skills.
A fine-tuned LLaMA-3 model trained on a large question-answering dataset with 390k+ instruction entries derived from the Minecraft Wiki.
A new open-world benchmark that includes thousands of long-term planning tasks, tens of dynamic-immediate planning tasks, and one autonomous exploration task.

The researchers demonstrate that the ODYSSEY framework can effectively evaluate the planning and exploration capabilities of agents. All datasets, model weights, and code are publicly available to motivate future research on more advanced autonomous agent solutions.

Technical Explanation

The paper introduces the ODYSSEY framework, which aims to empower Large Language Model (LLM)-based agents with open-world skills to explore the Minecraft environment. The framework consists of three key components:

Open-World Skill Library: The agent is equipped with an interactive skill library that includes 40 primitive skills and 183 compositional skills, allowing for diverse gameplay opportunities.
Fine-Tuned LLaMA-3 Model: The researchers fine-tuned the LLaMA-3 model on a large question-answering dataset with 390k+ instruction entries derived from the Minecraft Wiki, enabling the agent to understand and execute a wide range of tasks.
Open-World Benchmark: The paper introduces a new benchmark that includes thousands of long-term planning tasks, tens of dynamic-immediate planning tasks, and one autonomous exploration task, designed to comprehensively evaluate the agent's planning and exploration capabilities.

Extensive experiments demonstrate that the ODYSSEY framework can effectively assess the abilities of autonomous agents in the Minecraft environment, paving the way for more advanced solutions.

Critical Analysis

The paper presents a promising approach to developing generalist agents for open-world environments like Minecraft. However, the researchers acknowledge that the current benchmark only covers a limited set of tasks and that further work is needed to expand the scope and complexity of the evaluation.

Additionally, the paper does not address potential limitations or ethical considerations that may arise from deploying such autonomous agents in real-world scenarios. As the agents become more capable, it will be crucial to consider the implications of their actions and ensure that they operate within appropriate ethical boundaries.

Future research could explore ways to enhance the agent's reasoning and decision-making processes, as well as investigate methods for skill discovery and composition that go beyond the current approach.

Conclusion

The ODYSSEY framework represents a significant step forward in the development of generalist agents for open-world embodied environments. By equipping LLM-based agents with a diverse skill library and a comprehensive benchmark, the researchers have laid the groundwork for more advanced autonomous agent solutions.

The public availability of the datasets, model weights, and code will undoubtedly spur further research and innovation in this field, potentially leading to agents that can explore, understand, and interact with virtual worlds in increasingly sophisticated ways. As this technology continues to evolve, it will be crucial to consider the ethical implications and ensure that these agents are developed and deployed responsibly.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

👀

Odyssey: Empowering Agents with Open-World Skills

Shunyu Liu, Yaoru Li, Kongcheng Zhang, Zhenyu Cui, Wenkai Fang, Yuxuan Zheng, Tongya Zheng, Mingli Song

Recent studies have delved into constructing generalist agents for open-world embodied environments like Minecraft. Despite the encouraging results, existing efforts mainly focus on solving basic programmatic tasks, e.g., material collection and tool-crafting following the Minecraft tech-tree, treating the ObtainDiamond task as the ultimate goal. This limitation stems from the narrowly defined set of actions available to agents, requiring them to learn effective long-horizon strategies from scratch. Consequently, discovering diverse gameplay opportunities in the open world becomes challenging. In this work, we introduce ODYSSEY, a new framework that empowers Large Language Model (LLM)-based agents with open-world skills to explore the vast Minecraft world. ODYSSEY comprises three key parts: (1) An interactive agent with an open-world skill library that consists of 40 primitive skills and 183 compositional skills. (2) A fine-tuned LLaMA-3 model trained on a large question-answering dataset with 390k+ instruction entries derived from the Minecraft Wiki. (3) A new open-world benchmark includes thousands of long-term planning tasks, tens of dynamic-immediate planning tasks, and one autonomous exploration task. Extensive experiments demonstrate that the proposed ODYSSEY framework can effectively evaluate the planning and exploration capabilities of agents. All datasets, model weights, and code are publicly available to motivate future research on more advanced autonomous agent solutions.

7/23/2024

See and Think: Embodied Agent in Virtual Environment

Zhonghan Zhao, Wenhao Chai, Xuan Wang, Li Boyi, Shengyu Hao, Shidong Cao, Tian Ye, Gaoang Wang

Large language models (LLMs) have achieved impressive pro-gress on several open-world tasks. Recently, using LLMs to build embodied agents has been a hotspot. This paper proposes STEVE, a comprehensive and visionary embodied agent in the Minecraft virtual environment. STEVE comprises three key components: vision perception, language instruction, and code action. Vision perception involves interpreting visual information in the environment, which is then integrated into the LLMs component with agent state and task instruction. Language instruction is responsible for iterative reasoning and decomposing complex tasks into manageable guidelines. Code action generates executable skill actions based on retrieval in skill database, enabling the agent to interact effectively within the Minecraft environment. We also collect STEVE-21K dataset, which includes 600+ vision-environment pairs, 20K knowledge question-answering pairs, and 200+ skill-code pairs. We conduct continuous block search, knowledge question and answering, and tech tree mastery to evaluate the performance. Extensive experiments show that STEVE achieves at most 1.5x faster unlocking key tech trees and 2.5x quicker in block search tasks.

7/10/2024

💬

Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents

Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, Yitao Liang

We investigate the challenge of task planning for multi-task embodied agents in open-world environments. Two main difficulties are identified: 1) executing plans in an open-world environment (e.g., Minecraft) necessitates accurate and multi-step reasoning due to the long-term nature of tasks, and 2) as vanilla planners do not consider how easy the current agent can achieve a given sub-task when ordering parallel sub-goals within a complicated plan, the resulting plan could be inefficient or even infeasible. To this end, we propose $underline{D}$escribe, $underline{E}$xplain, $underline{P}$lan and $underline{S}$elect ($textbf{DEPS}$), an interactive planning approach based on Large Language Models (LLMs). DEPS facilitates better error correction on initial LLM-generated $textit{plan}$ by integrating $textit{description}$ of the plan execution process and providing self-$textit{explanation}$ of feedback when encountering failures during the extended planning phases. Furthermore, it includes a goal $textit{selector}$, which is a trainable module that ranks parallel candidate sub-goals based on the estimated steps of completion, consequently refining the initial plan. Our experiments mark the milestone of the first zero-shot multi-task agent that can robustly accomplish 70+ Minecraft tasks and nearly double the overall performances. Further testing reveals our method's general effectiveness in popularly adopted non-open-ended domains as well (i.e., ALFWorld and tabletop manipulation). The ablation and exploratory studies detail how our design beats the counterparts and provide a promising update on the $texttt{ObtainDiamond}$ grand challenge with our approach. The code is released at https://github.com/CraftJarvis/MC-Planner.

7/9/2024

🗣️

STEVE Series: Step-by-Step Construction of Agent Systems in Minecraft

Zhonghan Zhao, Wenhao Chai, Xuan Wang, Ke Ma, Kewei Chen, Dongxu Guo, Tian Ye, Yanting Zhang, Hongwei Wang, Gaoang Wang

Building an embodied agent system with a large language model (LLM) as its core is a promising direction. Due to the significant costs and uncontrollable factors associated with deploying and training such agents in the real world, we have decided to begin our exploration within the Minecraft environment. Our STEVE Series agents can complete basic tasks in a virtual environment and more challenging tasks such as navigation and even creative tasks, with an efficiency far exceeding previous state-of-the-art methods by a factor of $2.5times$ to $7.3times$. We begin our exploration with a vanilla large language model, augmenting it with a vision encoder and an action codebase trained on our collected high-quality dataset STEVE-21K. Subsequently, we enhanced it with a Critic and memory to transform it into a complex system. Finally, we constructed a hierarchical multi-agent system. Our recent work explored how to prune the agent system through knowledge distillation. In the future, we will explore more potential applications of STEVE agents in the real world.

6/18/2024