Segment Anything in 3D with Radiance Fields

Read original: arXiv:2304.12308 - Published 4/17/2024 by Jiazhong Cen, Jiemin Fang, Zanwei Zhou, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, Qi Tian

🏅

Overview

The Segment Anything Model (SAM) is a powerful 2D segmentation model that can generate high-quality results.
This paper aims to generalize SAM to segment 3D objects.
The proposed solution, referred to as SA3D, leverages the radiance field as a cheap and off-the-shelf prior to connect multi-view 2D images to the 3D space.
SA3D allows users to provide a 2D segmentation prompt (e.g., rough points) for the target object in a single view, which is used to generate a corresponding 2D mask with SAM.
The system then alternately performs mask inverse rendering and cross-view self-prompting to iteratively refine the 3D mask of the target object.

Plain English Explanation

The researchers wanted to take a powerful 2D image segmentation model called the Segment Anything Model (SAM) and extend it to work with 3D objects. Instead of having to go through the expensive process of collecting and annotating 3D data, they came up with a clever solution.

Their approach, called SA3D, uses something called a "radiance field" to connect the 2D images of an object to its 3D shape. The user only needs to provide a rough outline or "prompt" of the object in a single 2D image, and the system can then use that information to automatically generate a 3D segmentation of the object.

The key idea is that the system first uses SAM to get a 2D mask of the object from the user's prompt. It then uses the radiance field to project that 2D mask into the 3D space, refining the 3D mask through an iterative process. The radiance field helps the system understand how the 2D image relates to the 3D shape, acting as a kind of "bridge" between the two.

This approach allows the researchers to leverage the power of SAM for 2D segmentation, while also extending its capabilities to work with 3D objects without needing to collect expensive 3D data. The end result is a system that can segment 3D objects quickly and efficiently, just by using a simple 2D prompt from the user.

Technical Explanation

The researchers designed an efficient solution, leveraging the radiance field as a cheap and off-the-shelf prior that connects multi-view 2D images to the 3D space. They refer to the proposed solution as SA3D, short for Segment Anything in 3D.

With SA3D, the user is only required to provide a 2D segmentation prompt (e.g., rough points) for the target object in a single view, which is used to generate its corresponding 2D mask with SAM. Next, SA3D alternately performs mask inverse rendering and cross-view self-prompting across various views to iteratively refine the 3D mask of the target object.

For one view, mask inverse rendering projects the 2D mask obtained by SAM into the 3D space with guidance of the density distribution learned by the radiance field for 3D mask refinement. Then, cross-view self-prompting extracts reliable prompts automatically as the input to SAM from the rendered 2D mask of the inaccurate 3D mask for a new view.

The researchers show in their experiments that SA3D adapts to various scenes and achieves 3D segmentation within seconds. Their research reveals a potential methodology to lift the ability of a 2D segmentation model, such as SAM, to 3D.

Critical Analysis

The paper presents an interesting approach to extending a powerful 2D segmentation model, SAM, to work with 3D objects. The use of the radiance field as a cheap and efficient prior to connect 2D images to 3D space is a clever solution that avoids the need for expensive 3D data collection and annotation.

However, the paper does not address potential limitations or caveats of the approach. For example, it's unclear how well SA3D would perform on complex 3D scenes with occlusions or multiple objects. Additionally, the paper does not discuss the computational cost or runtime of the iterative refinement process, which could be a concern for real-time applications.

Further research could explore ways to optimize the performance of SA3D, potentially by incorporating additional priors or leveraging recent advancements in medical image segmentation. Additionally, it would be interesting to see how SA3D compares to other approaches for 3D segmentation, both in terms of accuracy and efficiency.

Conclusion

The Segment Anything in 3D (SA3D) model presented in this paper represents a promising step towards generalizing powerful 2D segmentation models to work with 3D objects. By leveraging the radiance field as a cheap and efficient prior, SA3D allows users to generate high-quality 3D segmentations with just a simple 2D prompt, without the need for expensive 3D data collection and annotation.

This research reveals a potential methodology to extend the capabilities of 2D segmentation models, such as SAM, to the 3D domain. The iterative refinement process of mask inverse rendering and cross-view self-prompting demonstrates the system's adaptability to various scenes, and the reported real-time performance suggests potential for practical applications.

As the field of computer vision continues to advance, approaches like SA3D that can bridge the gap between 2D and 3D segmentation will likely become increasingly important, enabling more robust and versatile visual understanding systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🏅

Segment Anything in 3D with Radiance Fields

Jiazhong Cen, Jiemin Fang, Zanwei Zhou, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, Qi Tian

The Segment Anything Model (SAM) emerges as a powerful vision foundation model to generate high-quality 2D segmentation results. This paper aims to generalize SAM to segment 3D objects. Rather than replicating the data acquisition and annotation procedure which is costly in 3D, we design an efficient solution, leveraging the radiance field as a cheap and off-the-shelf prior that connects multi-view 2D images to the 3D space. We refer to the proposed solution as SA3D, short for Segment Anything in 3D. With SA3D, the user is only required to provide a 2D segmentation prompt (e.g., rough points) for the target object in a single view, which is used to generate its corresponding 2D mask with SAM. Next, SA3D alternately performs mask inverse rendering and cross-view self-prompting across various views to iteratively refine the 3D mask of the target object. For one view, mask inverse rendering projects the 2D mask obtained by SAM into the 3D space with guidance of the density distribution learned by the radiance field for 3D mask refinement; Then, cross-view self-prompting extracts reliable prompts automatically as the input to SAM from the rendered 2D mask of the inaccurate 3D mask for a new view. We show in experiments that SA3D adapts to various scenes and achieves 3D segmentation within seconds. Our research reveals a potential methodology to lift the ability of a 2D segmentation model to 3D. Our code is available at https://github.com/Jumpat/SegmentAnythingin3D.

4/17/2024

Segment anything model 2: an application to 2D and 3D medical images

Haoyu Dong, Hanxue Gu, Yaqian Chen, Jichen Yang, Yuwen Chen, Maciej A. Mazurowski

Segment Anything Model (SAM) has gained significant attention because of its ability to segment various objects in images given a prompt. The recently developed SAM 2 has extended this ability to video inputs. This opens an opportunity to apply SAM to 3D images, one of the fundamental tasks in the medical imaging field. In this paper, we extensively evaluate SAM 2's ability to segment both 2D and 3D medical images by first collecting 21 medical imaging datasets, including surgical videos, common 3D modalities such as computed tomography (CT), magnetic resonance imaging (MRI), and positron emission tomography (PET) as well as 2D modalities such as X-ray and ultrasound. Two evaluation settings of SAM 2 are considered: (1) multi-frame 3D segmentation, where prompts are provided to one or multiple slice(s) selected from the volume, and (2) single-frame 2D segmentation, where prompts are provided to each slice. The former only applies to videos and 3D modalities, while the latter applies to all datasets. Our results show that SAM 2 exhibits similar performance as SAM under single-frame 2D segmentation, and has variable performance under multi-frame 3D segmentation depending on the choices of slices to annotate, the direction of the propagation, the predictions utilized during the propagation, etc. We believe our work enhances the understanding of SAM 2's behavior in the medical field and provides directions for future work in adapting SAM 2 to this domain. Our code is available at: https://github.com/mazurowski-lab/segment-anything2-medical-evaluation.

8/23/2024

SAM3D: Zero-Shot Semi-Automatic Segmentation in 3D Medical Images with the Segment Anything Model

Trevor J. Chan, Aarush Sahni, Yijin Fang, Jie Li, Alisha Luthra, Alison Pouch, Chamith S. Rajapakse

We introduce SAM3D, a new approach to semi-automatic zero-shot segmentation of 3D images building on the existing Segment Anything Model. We achieve fast and accurate segmentations in 3D images with a four-step strategy involving: user prompting with 3D polylines, volume slicing along multiple axes, slice-wide inference with a pretrained model, and recomposition and refinement in 3D. We evaluated SAM3D performance qualitatively on an array of imaging modalities and anatomical structures and quantify performance for specific structures in abdominal pelvic CT and brain MRI. Notably, our method achieves good performance with zero model training or finetuning, making it particularly useful for tasks with a scarcity of preexisting labeled data. By enabling users to create 3D segmentations of unseen data quickly and with dramatically reduced manual input, these methods have the potential to aid surgical planning and education, diagnostic imaging, and scientific research.

8/9/2024

EmbodiedSAM: Online Segment Any 3D Thing in Real Time

Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu

Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration, so an online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed. Since high-quality 3D data is limited, directly training such a model in 3D is almost infeasible. Meanwhile, vision foundation models (VFM) has revolutionized the field of 2D computer vision with superior performance, which makes the use of VFM to assist embodied 3D perception a promising direction. However, most existing VFM-assisted 3D perception methods are either offline or too slow that cannot be applied in practical embodied tasks. In this paper, we aim to leverage Segment Anything Model (SAM) for real-time 3D instance segmentation in an online setting. This is a challenging problem since future frames are not available in the input streaming RGB-D video, and an instance may be observed in several frames so object matching between frames is required. To address these challenges, we first propose a geometric-aware query lifting module to represent the 2D masks generated by SAM by 3D-aware queries, which is then iteratively refined by a dual-level query decoder. In this way, the 2D masks are transferred to fine-grained shapes on 3D point clouds. Benefit from the query representation for 3D masks, we can compute the similarity matrix between the 3D masks from different views by efficient matrix operation, which enables real-time inference. Experiments on ScanNet, ScanNet200, SceneNN and 3RScan show our method achieves leading performance even compared with offline methods. Our method also demonstrates great generalization ability in several zero-shot dataset transferring experiments and show great potential in open-vocabulary and data-efficient setting. Code and demo are available at https://xuxw98.github.io/ESAM/, with only one RTX 3090 GPU required for training and evaluation.

8/22/2024