FoundationGrasp: Generalizable Task-Oriented Grasping with Foundation Models






Published 4/17/2024 by Chao Tang, Dehao Huang, Wenlong Dong, Ruinian Xu, Hong Zhang
FoundationGrasp: Generalizable Task-Oriented Grasping with Foundation Models


Task-oriented grasping (TOG), which refers to the problem of synthesizing grasps on an object that are configurationally compatible with the downstream manipulation task, is the first milestone towards tool manipulation. Analogous to the activation of two brain regions responsible for semantic and geometric reasoning during cognitive processes, modeling the complex relationship between objects, tasks, and grasps requires rich prior knowledge about objects and tasks. Existing methods typically limit the prior knowledge to a closed-set scope and cannot support the generalization to novel objects and tasks out of the training set. To address such a limitation, we propose FoundationGrasp, a foundation model-based TOG framework that leverages the open-ended knowledge from foundation models to learn generalizable TOG skills. Comprehensive experiments are conducted on the contributed Language and Vision Augmented TaskGrasp (LaViA-TaskGrasp) dataset, demonstrating the superiority of FoudationGrasp over existing methods when generalizing to novel object instances, object classes, and tasks out of the training set. Furthermore, the effectiveness of FoudationGrasp is validated in real-robot grasping and manipulation experiments on a 7 DoF robotic arm. Our code, data, appendix, and video are publicly available at

Create account to get full access


If you already have an account, we'll log you in


  • This paper introduces FoundationGrasp, a system that uses large language models (foundation models) to enable robots to perform generalizable, task-oriented grasping.
  • The key idea is to leverage the broad knowledge and language understanding capabilities of foundation models to enable robots to grasp objects in a way that is tailored to specific tasks or goals, rather than just maximizing grip strength.
  • The authors demonstrate the effectiveness of FoundationGrasp on a range of grasping tasks, showing it can outperform prior approaches that are more specialized.

Plain English Explanation

The paper presents a new approach to robot grasping called FoundationGrasp that uses large AI language models (called "foundation models") to enable more versatile and task-oriented grasping. Typical robot grasping systems focus mainly on maximizing the strength of the robot's grip on an object. In contrast, FoundationGrasp aims to grasp objects in a way that is tailored to the specific task or goal, rather than just grip strength.

The key insight is that foundation models, which are trained on vast amounts of text data, have developed a broad understanding of language and the world that can be leveraged for robot control. By incorporating this knowledge, FoundationGrasp can grasp objects in a more "intelligent" way that is adapted to the user's intent, rather than just trying to grip tightly.

For example, if the goal is to pour liquid from a container, FoundationGrasp might grasp the container in a way that leaves the pouring spout accessible, even if that doesn't result in the maximum grip strength. Or if the goal is to hand an object to a person, FoundationGrasp might orient the object in a natural hand-over position.

The authors show that FoundationGrasp outperforms prior grasping approaches on a variety of tasks, demonstrating the power of leveraging foundation models for more versatile and task-oriented robot control.

Technical Explanation

The paper introduces FoundationGrasp, a system that uses large language models (foundation models) to enable robots to perform more generalizable, task-oriented grasping. Prior grasping approaches have typically focused on maximizing the strength of the robot's grip, without considering the specific task or goal.

In contrast, FoundationGrasp aims to grasp objects in a way that is tailored to the user's intent or the task at hand. The key insight is that foundation models, which are trained on vast amounts of text data, have developed a broad understanding of language, semantics, and common sense reasoning that can be leveraged for robot control.

FoundationGrasp incorporates this knowledge by conditioning the grasping policy on a language description of the task or goal. For example, if the goal is to "pour liquid from the container," FoundationGrasp would grasp the container in a way that leaves the pouring spout accessible, even if that doesn't result in the maximum grip strength.

The authors evaluate FoundationGrasp on a range of grasping tasks, including object handover, pouring, and tool use. They show that FoundationGrasp outperforms prior approaches, such as Generalizing 6-DOF Grasp Detection via Domain Adaptation, Learning Cross-Hand Policies for High-DOF Reaching, and CenterGrasp: Object-Aware Implicit Representation Learning for Simultaneous Grasp and Grasp-Type Prediction, which are more specialized and do not leverage the broad knowledge of foundation models.

Critical Analysis

The paper presents a compelling approach to robot grasping that leverages the power of large language models in a novel way. The authors demonstrate the effectiveness of FoundationGrasp on a range of tasks, showing that it can outperform prior specialized grasping systems.

One potential limitation of the approach is that it relies on the availability of a language description of the task or goal. In real-world scenarios, users may not always be able to provide such a description, or the description may be ambiguous or incomplete. The authors acknowledge this challenge and suggest that future work could explore ways to infer the task intent from other modalities, such as visual cues or user demonstrations.

Additionally, while the paper focuses on grasping tasks, the underlying idea of leveraging foundation models for more generalizable and task-oriented robot control could potentially be applied to a wider range of robotic manipulation and navigation tasks. Exploring these broader applications could be an interesting direction for future research.

Overall, the FoundationGrasp approach represents an exciting step forward in the field of robot grasping, and the authors' use of foundation models to enable more versatile and intelligent robot control is a promising direction for the field.


This paper introduces FoundationGrasp, a novel approach to robot grasping that leverages large language models (foundation models) to enable more generalizable, task-oriented grasping. By conditioning the grasping policy on a language description of the task or goal, FoundationGrasp can grasp objects in a way that is tailored to the user's intent, rather than just maximizing grip strength.

The authors demonstrate the effectiveness of FoundationGrasp on a range of grasping tasks, showing that it can outperform prior specialized grasping systems. This work represents an exciting step forward in the field of robot grasping, and the broader idea of leveraging foundation models for more versatile and intelligent robot control could have significant implications for a wide range of robotic applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers


What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

Dingzhe Li, Yixiang Jin, Yong A, Hongze Yu, Jun Shi, Xiaoshuai Hao, Peng Hao, Huaping Liu, Fuchun Sun, Bin Fang





The realization of universal robots is an ultimate goal of researchers. However, a key hurdle in achieving this goal lies in the robots' ability to manipulate objects in their unstructured surrounding environments according to different tasks. The learning-based approach is considered an effective way to address generalization. The impressive performance of foundation models in the fields of computer vision and natural language suggests the potential of embedding foundation models into manipulation tasks as a viable path toward achieving general manipulation capability. However, we believe achieving general manipulation capability requires an overarching framework akin to auto driving. This framework should encompass multiple functional modules, with different foundation models assuming distinct roles in facilitating general manipulation capability. This survey focuses on the contributions of foundation models to robot learning for manipulation. We propose a comprehensive framework and detail how foundation models can address challenges in each module of the framework. What's more, we examine current approaches, outline challenges, suggest future research directions, and identify potential risks associated with integrating foundation models into this domain.

Read more



Cross-Category Functional Grasp Tansfer

Rina Wu, Tianqiang Zhu, Xiangbo Lin, Yi Sun





Generating grasps for a dexterous hand often requires numerous grasping annotations. However, annotating high DoF dexterous hand poses is quite challenging. Especially for functional grasps, the grasp pose must be convenient for subsequent manipulation tasks. This prompt us to explore how people achieve manipulations on new objects based on past grasp experiences. We find that when grasping new items, people are adept at discovering and leveraging various similarities between objects, including shape, layout, and grasp type. Considering this, we analyze and collect grasp-related similarity relationships among 51 common tool-like object categories and annotate semantic grasp representation for 1768 objects. These objects are connected through similarities to form a knowledge graph, which helps infer our proposed cross-category functional grasp synthesis. Through extensive experiments, we demonstrate that the grasp-related knowledge indeed contributed to achieving functional grasp transfer across unknown or entirely new categories of objects. We will publicly release the dataset and code to facilitate future research.

Read more


Generalizing 6-DoF Grasp Detection via Domain Prior Knowledge

Generalizing 6-DoF Grasp Detection via Domain Prior Knowledge

Haoxiang Ma, Modi Shi, Boyang Gao, Di Huang





We focus on the generalization ability of the 6-DoF grasp detection method in this paper. While learning-based grasp detection methods can predict grasp poses for unseen objects using the grasp distribution learned from the training set, they often exhibit a significant performance drop when encountering objects with diverse shapes and structures. To enhance the grasp detection methods' generalization ability, we incorporate domain prior knowledge of robotic grasping, enabling better adaptation to objects with significant shape and structure differences. More specifically, we employ the physical constraint regularization during the training phase to guide the model towards predicting grasps that comply with the physical rule on grasping. For the unstable grasp poses predicted on novel objects, we design a contact-score joint optimization using the projection contact map to refine these poses in cluttered scenarios. Extensive experiments conducted on the GraspNet-1billion benchmark demonstrate a substantial performance gain on the novel object set and the real-world grasping experiments also demonstrate the effectiveness of our generalizing 6-DoF grasp detection method.

Read more


Towards Open-World Grasping with Large Vision-Language Models

Towards Open-World Grasping with Large Vision-Language Models

Georgios Tziafas, Hamidreza Kasaei





The ability to grasp objects in-the-wild from open-ended language instructions constitutes a fundamental challenge in robotics. An open-world grasping system should be able to combine high-level contextual with low-level physical-geometric reasoning in order to be applicable in arbitrary scenarios. Recent works exploit the web-scale knowledge inherent in large language models (LLMs) to plan and reason in robotic context, but rely on external vision and action models to ground such knowledge into the environment and parameterize actuation. This setup suffers from two major bottlenecks: a) the LLM's reasoning capacity is constrained by the quality of visual grounding, and b) LLMs do not contain low-level spatial understanding of the world, which is essential for grasping in contact-rich scenarios. In this work we demonstrate that modern vision-language models (VLMs) are capable of tackling such limitations, as they are implicitly grounded and can jointly reason about semantics and geometry. We propose OWG, an open-world grasping pipeline that combines VLMs with segmentation and grasp synthesis models to unlock grounded world understanding in three stages: open-ended referring segmentation, grounded grasp planning and grasp ranking via contact reasoning, all of which can be applied zero-shot via suitable visual prompting mechanisms. We conduct extensive evaluation in cluttered indoor scene datasets to showcase OWG's robustness in grounding from open-ended language, as well as open-world robotic grasping experiments in both simulation and hardware that demonstrate superior performance compared to previous supervised and zero-shot LLM-based methods.

Read more
