FastSAM

Maintainer: An-619

Last updated 9/6/2024

❗

Property	Value
Run this model	Run on HuggingFace
API spec	View on HuggingFace
Github link	No Github link provided
Paper link	No paper link provided

Create account to get full access

Model overview

FastSAM is a CNN Segment Anything Model trained by only 2% of the SA-1B dataset published by the Segment Anything Model (SAM) authors. Despite this much smaller training dataset, FastSAM achieves comparable performance to the full SAM model, while running 50 times faster. FastSAM was developed by An-619 at the CASIA-IVA-Lab.

The Segment Anything Model (SAM) is a state-of-the-art model that can generate high quality object masks from various input prompts like points or bounding boxes. It has been trained on a massive dataset of 11 million images and 1.1 billion masks. Another variant, the SAM-ViT-Base model, uses a Vision Transformer (ViT) backbone, while the SAM-ViT-Huge version uses an even larger ViT-H backbone.

Model inputs and outputs

Inputs

Image: The input image for which segmentation masks will be generated.
Text prompt: An optional text description of the object to be segmented.
Box prompt: An optional bounding box around the object to be segmented.
Point prompt: An optional set of points indicating the object to be segmented.

Outputs

Segmentation masks: One or more segmentation masks corresponding to the objects in the input image, based on the provided prompts.
Confidence scores: Confidence scores for each of the output segmentation masks.

Capabilities

FastSAM can generate high-quality object segmentation masks at a much faster speed than the original SAM model. This makes it particularly useful for real-time applications or when computational resources are limited. The model has shown strong zero-shot performance on a variety of segmentation tasks, similar to the full SAM model.

What can I use it for?

FastSAM can be used in a wide range of computer vision applications that require object segmentation, such as:

Image editing: Quickly select and mask objects in an image for editing, compositing, or other manipulations.
Autonomous systems: Extract detailed object information from camera inputs for tasks like self-driving cars, robots, or drones.
Content creation: Easily isolate and extract objects from images for use in digital art, 3D modeling, or other creative projects.

Things to try

Try experimenting with different input prompts - text, bounding boxes, or point clicks - to see how the model's segmentation results vary. You can also compare the speed and performance of FastSAM to the original SAM model on your specific use case. Additionally, explore the different inference options provided by the FastSAM codebase.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Models

🔮

segment-anything

ybelkada

The segment-anything model, developed by researchers at Meta AI Research, is a powerful image segmentation model that can generate high-quality object masks from various input prompts such as points or bounding boxes. Trained on a large dataset of 11 million images and 1.1 billion masks, the model has strong zero-shot performance on a variety of segmentation tasks. The ViT-Huge version of the Segment Anything Model (SAM) is a particularly capable variant. The model consists of three main components: a ViT-based image encoder that computes image embeddings, a prompt encoder that generates embeddings for points and bounding boxes, and a mask decoder that performs cross-attention between the image and prompt embeddings to output the final segmentation masks. This architecture allows the model to transfer zero-shot to new image distributions and tasks, often matching or exceeding the performance of prior fully supervised methods. Model Inputs and Outputs Inputs Image**: The input image for which segmentation masks should be generated. Prompts**: The model can take various types of prompts as input, including: Points: 2D locations on the image indicating the approximate position of the object of interest. Bounding Boxes: The coordinates of a bounding box around the object of interest. Segmentation Masks: An existing segmentation mask that can be refined by the model. Outputs Segmentation Masks**: The model outputs high-quality segmentation masks for the objects in the input image, guided by the provided prompts. Scores**: The model also returns confidence scores for each predicted mask, indicating the estimated quality of the segmentation. Capabilities The segment-anything model excels at generating detailed and accurate segmentation masks for a wide variety of objects in an image, even in challenging scenarios with occlusions or complex backgrounds. Unlike many previous segmentation models, it can transfer zero-shot to new image distributions and tasks, often outperforming prior fully supervised approaches. For example, the model can be used to segment small objects like windows in a car, larger objects like people or animals, or even entire scenes with multiple overlapping elements. The ability to provide prompts like points or bounding boxes makes the model highly flexible and adaptable to different use cases. What Can I Use It For? The segment-anything model has a wide range of potential applications, including: Object Detection and Segmentation**: Identify and delineate specific objects in images for applications like autonomous driving, image understanding, and augmented reality. Instance Segmentation**: Separate individual objects within a scene, which can be useful for tasks like inventory management, robotics, and image editing. Annotation and Labeling**: Quickly generate high-quality segmentation masks to annotate and label image datasets, accelerating the development of computer vision systems. Content-Aware Image Editing**: Leverage the model's ability to segment objects to enable advanced editing capabilities, such as selective masking, object removal, and image compositing. Things to Try One interesting aspect of the segment-anything model is its ability to adapt to new tasks and distributions through the use of prompts. Try experimenting with different types of prompts, such as using bounding boxes instead of points, or providing an initial segmentation mask as input to refine. You can also explore the model's performance on a variety of image types, from natural scenes to synthetic or artistic images, to understand its versatility and limitations. Additionally, the ViT-Huge version of the Segment Anything Model may offer increased segmentation accuracy and detail compared to the base model, so it's worth trying out as well.

Updated Invalid Date

Image-to-Image

🛠️

sam-vit-base

facebook

The sam-vit-base model is a Segment Anything Model (SAM) developed by researchers at Facebook. SAM is a powerful image segmentation model that can generate high-quality object masks from input prompts such as points or bounding boxes. It has been trained on a dataset of 11 million images and 1.1 billion masks, giving it impressive zero-shot performance on a variety of segmentation tasks. SAM is made up of three main modules: a VisionEncoder that encodes the input image using a Vision Transformer (ViT) architecture, a PromptEncoder that generates embeddings for the input prompts, and a MaskDecoder that produces the output segmentation masks. The model can be used to generate masks for all objects in an image, or for specific objects based on provided prompts. Similar models include the sam-vit-huge which uses a larger ViT-H backbone, and the segment-anything model which provides additional tooling and support. Model Inputs and Outputs Inputs Image**: The input image for which segmentation masks should be generated. Input Prompts**: Points, bounding boxes, or other prompts that indicate the regions of interest in the image. Outputs Segmentation Masks**: One or more binary masks indicating the regions in the image corresponding to the input prompts. Mask Scores**: Scores indicating the confidence of the model in each predicted mask. Capabilities The sam-vit-base model is capable of generating high-quality segmentation masks for a wide variety of objects in an image, even in complex scenes. It can handle multiple prompts simultaneously, allowing users to segment multiple objects of interest with a single inference. The model's zero-shot capabilities also enable it to perform well on new domains and tasks without additional fine-tuning. What Can I Use It For? The sam-vit-base model can be a powerful tool for a variety of computer vision applications, such as: Content Moderation**: Use the model to automatically detect and mask inappropriate or explicit content in images. Image Editing**: Leverage the model's precise segmentation to enable advanced image editing capabilities, such as object removal, background replacement, or composite image creation. Robotic Perception**: Integrate the model into robotic systems to enable fine-grained object understanding and manipulation. Medical Imaging**: Apply the model to medical imaging tasks like organ segmentation or tumor detection. The segment-anything model provides additional tools and support for working with SAM, including pre-built pipelines and ONNX export capabilities. Things to Try One interesting aspect of the sam-vit-base model is its ability to perform zero-shot segmentation, where it can generate masks for objects without any prior training on those specific classes. Try experimenting with a variety of input prompts and images to see how the model performs on different types of objects and scenes. Additionally, you can compare the performance of the sam-vit-base model to the larger sam-vit-huge version to understand the tradeoffs between model size and accuracy.

Updated Invalid Date

Image-to-Image

📉

sam-vit-huge

facebook

101

The sam-vit-huge model is a powerful AI system developed by Facebook researchers that can generate high-quality object masks from input prompts such as points or boxes. It is a part of the Segment Anything project, which aims to build the largest segmentation dataset to date with over 1 billion masks on 11 million images. The model is based on a Vision Transformer (ViT) architecture and has been trained on a vast dataset, giving it impressive zero-shot performance on a variety of segmentation tasks. Similar models like the CLIP ViT model and Anything Preservation also use transformer-based architectures for image tasks, but the sam-vit-huge model is specifically designed for high-quality object segmentation. Model inputs and outputs The sam-vit-huge model takes input prompts, such as points or bounding boxes, and generates pixel-level masks for the objects in the image. This allows users to quickly and accurately segment objects of interest without the need for laborious manual annotation. Inputs Prompts**: Points or bounding boxes that indicate the objects of interest in the image Outputs Object masks**: Pixel-level segmentation masks for the objects in the image, based on the input prompts Capabilities The sam-vit-huge model excels at generating high-quality, detailed object masks. It can accurately segment a wide variety of objects, even in complex scenes with multiple overlapping elements. For example, the model can segment individual cans in an image of a group of bean cans, or identify distinct animals in a forest scene. What can I use it for? The sam-vit-huge model can be a valuable tool for a variety of applications that require accurate object segmentation, such as: Image editing and manipulation**: Isolating objects in an image for selective editing, compositing, or processing Robotics and autonomous systems**: Enabling robots to perceive and interact with specific objects in their environments Medical imaging**: Segmenting anatomical structures in medical scans for analysis and diagnosis Satellite and aerial imagery analysis**: Identifying and extracting features of interest from remote sensing data By leveraging the model's impressive zero-shot capabilities, users can quickly adapt it to new domains and tasks without the need for extensive fine-tuning or retraining. Things to try One key insight about the sam-vit-huge model is its ability to generalize to a wide range of segmentation tasks, thanks to its training on a vast and diverse dataset. This suggests that the model could be a powerful tool for exploring novel applications beyond the traditional use cases for object segmentation. For example, you could experiment with using the model to segment unusual or unconventional objects, such as abstract shapes, text, or even emojis, to see how it performs and identify any interesting capabilities or limitations.

Updated Invalid Date

Text-to-Image

Model overview

Model inputs and outputs

Inputs

Outputs

Capabilities

What can I use it for?

Things to try

Related Models

segment-anything

sam-vit-base

sam-vit-huge

sam-2