Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance

2312.07530

Published 4/24/2024 by Kuan-Chih Huang, Yi-Hsuan Tsai, Ming-Hsuan Yang

Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance

Abstract

Weakly supervised 3D object detection aims to learn a 3D detector with lower annotation cost, e.g., 2D labels. Unlike prior work which still relies on few accurate 3D annotations, we propose a framework to study how to leverage constraints between 2D and 3D domains without requiring any 3D labels. Specifically, we employ visual data from three perspectives to establish connections between 2D and 3D domains. First, we design a feature-level constraint to align LiDAR and image features based on object-aware regions. Second, the output-level constraint is developed to enforce the overlap between 2D and projected 3D box estimations. Finally, the training-level constraint is utilized by producing accurate and consistent 3D pseudo-labels that align with the visual data. We conduct extensive experiments on the KITTI dataset to validate the effectiveness of the proposed three constraints. Without using any 3D labels, our method achieves favorable performance against state-of-the-art approaches and is competitive with the method that uses 500-frame 3D annotations. Code and models will be made publicly available at https://github.com/kuanchihhuang/VG-W3D.

Create account to get full access

Overview

This paper proposes a novel weakly supervised approach for 3D object detection that leverages multi-level visual guidance.
The method aims to learn 3D object detection without requiring full 3D bounding box annotations, which can be costly and time-consuming to obtain.
Instead, the approach utilizes weaker forms of supervision, such as 2D bounding boxes and semantic segmentation masks, to guide the 3D object detection model.

Plain English Explanation

The paper presents a new way to train 3D object detectors without needing detailed 3D bounding box labels for every object. This is important because collecting those 3D labels can be very difficult and expensive.

Instead, the method uses simpler annotations like 2D bounding boxes and semantic segmentation masks to help the 3D object detector learn. The key idea is to use these 'weaker' forms of supervision to provide guidance at multiple levels - from the overall scene down to individual object parts. This multi-level visual guidance helps the 3D detector learn effectively without requiring the full 3D annotations.

By avoiding the need for expensive 3D labels, this approach could make 3D object detection more accessible and practical for real-world applications. The internal link and internal link works on weakly supervised 3D scene understanding are closely related to this paper's focus.

Technical Explanation

The proposed framework consists of a 3D object detection backbone network and a multi-level visual guidance module. The backbone takes in a 3D point cloud and predicts 3D bounding boxes for objects.

The visual guidance module leverages 2D bounding box and semantic segmentation annotations to provide supervision at three levels:

Scene-level Guidance: The module learns to predict the overall 3D scene layout from the 2D annotations.
Object-level Guidance: It also learns to associate 2D bounding boxes with their corresponding 3D object proposals.
Part-level Guidance: Finally, it learns to segment object parts using the semantic masks and align them with the 3D object proposals.

This multi-level supervision helps the 3D object detector learn richer visual representations and better localize objects in 3D space, even without full 3D annotations. The internal link and internal link works explore related semi-supervised and multi-view 3D object detection approaches.

Critical Analysis

The paper demonstrates the effectiveness of the proposed weakly supervised 3D object detection approach on standard benchmarks. However, it does not provide a thorough analysis of the method's limitations or failure cases.

For example, the performance of the approach may degrade when the 2D annotations are noisy or incomplete. Additionally, the reliance on semantic segmentation masks could make the method sensitive to errors in the segmentation model. Further research is needed to understand the robustness of this approach to real-world variations in the input data.

The internal link paper explores ways to enhance 3D object detection using sparse point clouds, which could be a valuable direction to consider in conjunction with the weakly supervised approach presented here.

Conclusion

This paper presents a novel weakly supervised 3D object detection framework that leverages multi-level visual guidance from 2D annotations, without requiring costly 3D bounding box labels. By avoiding the need for full 3D annotations, this approach has the potential to make 3D object detection more accessible and practical for real-world applications.

The key technical insight is to use weaker forms of supervision, such as 2D bounding boxes and semantic segmentation, to guide the 3D detector at multiple levels - from the overall scene layout down to individual object parts. This multi-level guidance helps the model learn effective 3D representations, even in the absence of full 3D annotations.

While the paper demonstrates promising results, further research is needed to fully understand the limitations and robustness of this weakly supervised approach. Nonetheless, this work represents an important step towards more efficient and practical 3D object detection systems.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labeling

Xu Wang, Yifan Li, Qiudan Zhang, Wenhui Wu, Mark Junjie Li, Jianmin Jinag

Learning to build 3D scene graphs is essential for real-world perception in a structured and rich fashion. However, previous 3D scene graph generation methods utilize a fully supervised learning manner and require a large amount of entity-level annotation data of objects and relations, which is extremely resource-consuming and tedious to obtain. To tackle this problem, we propose 3D-VLAP, a weakly-supervised 3D scene graph generation method via Visual-Linguistic Assisted Pseudo-labeling. Specifically, our 3D-VLAP exploits the superior ability of current large-scale visual-linguistic models to align the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds. First, we establish the positional correspondence from 3D point clouds to 2D images via camera intrinsic and extrinsic parameters, thereby achieving alignment of 3D point clouds and 2D images. Subsequently, a large-scale cross-modal visual-linguistic model is employed to indirectly align 3D instances with the textual category labels of objects by matching 2D images with object category labels. The pseudo labels for objects and relations are then produced for 3D-VLAP model training by calculating the similarity between visual embeddings and textual category embeddings of objects and relations encoded by the visual-linguistic model, respectively. Ultimately, we design an edge self-attention based graph neural network to generate scene graphs of 3D point cloud scenes. Extensive experiments demonstrate that our 3D-VLAP achieves comparable results with current advanced fully supervised methods, meanwhile significantly alleviating the pressure of data annotation.

4/4/2024

cs.CV

Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment

Xiaoxu Xu, Yitian Yuan, Qiudan Zhang, Wenhui Wu, Zequn Jie, Lin Ma, Xu Wang

Learning to ground natural language queries to target objects or regions in 3D point clouds is quite essential for 3D scene understanding. Nevertheless, existing 3D visual grounding approaches require a substantial number of bounding box annotations for text queries, which is time-consuming and labor-intensive to obtain. In this paper, we propose textbf{3D-VLA}, a weakly supervised approach for textbf{3D} visual grounding based on textbf{V}isual textbf{L}inguistic textbf{A}lignment. Our 3D-VLA exploits the superior ability of current large-scale vision-language models (VLMs) on aligning the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds with no need for fine-grained box annotations in the training procedure. During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images. To the best of our knowledge, this is the first work to investigate 3D visual grounding in a weakly supervised manner by involving large scale vision-language models, and extensive experiments on ReferIt3D and ScanRefer datasets demonstrate that our 3D-VLA achieves comparable and even superior results over the fully supervised methods.

4/16/2024

cs.CV cs.CL

ImageNet3D: Towards General-Purpose Object-Level 3D Understanding

Wufei Ma, Guanning Zeng, Guofeng Zhang, Qihao Liu, Letian Zhang, Adam Kortylewski, Yaoyao Liu, Alan Yuille

A vision model with general-purpose object-level 3D understanding should be capable of inferring both 2D (e.g., class name and bounding box) and 3D information (e.g., 3D location and 3D viewpoint) for arbitrary rigid objects in natural images. This is a challenging task, as it involves inferring 3D information from 2D signals and most importantly, generalizing to rigid objects from unseen categories. However, existing datasets with object-level 3D annotations are often limited by the number of categories or the quality of annotations. Models developed on these datasets become specialists for certain categories or domains, and fail to generalize. In this work, we present ImageNet3D, a large dataset for general-purpose object-level 3D understanding. ImageNet3D augments 200 categories from the ImageNet dataset with 2D bounding box, 3D pose, 3D location annotations, and image captions interleaved with 3D information. With the new annotations available in ImageNet3D, we could (i) analyze the object-level 3D awareness of visual foundation models, and (ii) study and develop general-purpose models that infer both 2D and 3D information for arbitrary rigid objects in natural images, and (iii) integrate unified 3D models with large language models for 3D-related reasoning.. We consider two new tasks, probing of object-level 3D awareness and open vocabulary pose estimation, besides standard classification and pose estimation. Experimental results on ImageNet3D demonstrate the potential of our dataset in building vision models with stronger general-purpose object-level 3D understanding.

6/17/2024

cs.CV

Shelf-Supervised Multi-Modal Pre-Training for 3D Object Detection

Mehar Khurana, Neehar Peri, Deva Ramanan, James Hays

State-of-the-art 3D object detectors are often trained on massive labeled datasets. However, annotating 3D bounding boxes remains prohibitively expensive and time-consuming, particularly for LiDAR. Instead, recent works demonstrate that self-supervised pre-training with unlabeled data can improve detection accuracy with limited labels. Contemporary methods adapt best-practices for self-supervised learning from the image domain to point clouds (such as contrastive learning). However, publicly available 3D datasets are considerably smaller and less diverse than those used for image-based self-supervised learning, limiting their effectiveness. We do note, however, that such data is naturally collected in a multimodal fashion, often paired with images. Rather than pre-training with only self-supervised objectives, we argue that it is better to bootstrap point cloud representations using image-based foundation models trained on internet-scale image data. Specifically, we propose a shelf-supervised approach (e.g. supervised with off-the-shelf image foundation models) for generating zero-shot 3D bounding boxes from paired RGB and LiDAR data. Pre-training 3D detectors with such pseudo-labels yields significantly better semi-supervised detection accuracy than prior self-supervised pretext tasks. Importantly, we show that image-based shelf-supervision is helpful for training LiDAR-only and multi-modal (RGB + LiDAR) detectors. We demonstrate the effectiveness of our approach on nuScenes and WOD, significantly improving over prior work in limited data settings.

6/17/2024

cs.CV cs.LG cs.RO