Salient Object Detection in RGB-D Videos

2310.15482

Published 5/22/2024 by Ao Mou, Yukang Lu, Jiahao He, Dingyao Min, Keren Fu, Qijun Zhao

🔎

Abstract

Given the widespread adoption of depth-sensing acquisition devices, RGB-D videos and related data/media have gained considerable traction in various aspects of daily life. Consequently, conducting salient object detection (SOD) in RGB-D videos presents a highly promising and evolving avenue. Despite the potential of this area, SOD in RGB-D videos remains somewhat under-explored, with RGB-D SOD and video SOD (VSOD) traditionally studied in isolation. To explore this emerging field, this paper makes two primary contributions: the dataset and the model. On one front, we construct the RDVS dataset, a new RGB-D VSOD dataset with realistic depth and characterized by its diversity of scenes and rigorous frame-by-frame annotations. We validate the dataset through comprehensive attribute and object-oriented analyses, and provide training and testing splits. Moreover, we introduce DCTNet+, a three-stream network tailored for RGB-D VSOD, with an emphasis on RGB modality and treats depth and optical flow as auxiliary modalities. In pursuit of effective feature enhancement, refinement, and fusion for precise final prediction, we propose two modules: the multi-modal attention module (MAM) and the refinement fusion module (RFM). To enhance interaction and fusion within RFM, we design a universal interaction module (UIM) and then integrate holistic multi-modal attentive paths (HMAPs) for refining multi-modal low-level features before reaching RFMs. Comprehensive experiments, conducted on pseudo RGB-D video datasets alongside our RDVS, highlight the superiority of DCTNet+ over 17 VSOD models and 14 RGB-D SOD models. Ablation experiments were performed on both pseudo and realistic RGB-D video datasets to demonstrate the advantages of individual modules as well as the necessity of introducing realistic depth. Our code together with RDVS dataset will be available at https://github.com/kerenfu/RDVS/.

Create account to get full access

Overview

Researchers have developed a new dataset and model for salient object detection in RGB-D (color and depth) videos, which is a promising but under-explored area.
The RDVS dataset provides a diverse collection of realistic RGB-D videos with detailed annotations, addressing limitations of previous datasets.
The DCTNet+ model introduces novel modules to effectively integrate and refine RGB, depth, and optical flow features for accurate salient object detection.

Plain English Explanation

The paper focuses on a computer vision task called salient object detection (SOD) in RGB-D (color and depth) videos. As depth-sensing cameras have become more common, there is growing interest in using both color and depth information to identify the most important or "salient" objects in a scene.

The researchers first created a new dataset called RDVS, which contains a wide variety of realistic RGB-D videos with detailed annotations identifying the salient objects in each frame. This helps address issues with previous datasets that were less diverse or lacked thorough labeling.

The researchers also developed a new deep learning model called DCTNet+ that takes in the color, depth, and motion information from RGB-D videos and learns to accurately detect the salient objects. DCTNet+ uses novel modules to effectively combine and refine these different types of visual features, leading to more precise salient object predictions.

Overall, this work advances the state-of-the-art in RGB-D video salient object detection by providing a high-quality dataset and a more effective model for this emerging computer vision application. The improved capabilities could benefit a range of real-world tasks like autonomous driving, video surveillance, and image/video editing.

Technical Explanation

The paper first introduces the RDVS dataset, a new RGB-D video salient object detection (RGBD-VSOD) dataset that addresses limitations of prior datasets. RDVS contains a diverse collection of realistic RGB-D videos with rigorous frame-by-frame annotations of salient objects. The researchers validate RDVS through comprehensive analyses of its attributes and salient object characteristics.

The core contribution is the DCTNet+ model, a three-stream network designed for RGBD-VSOD. DCTNet+ emphasizes the RGB modality while treating depth and optical flow as auxiliary inputs. To enhance feature integration and refinement, DCTNet+ introduces two novel modules:

Multi-Modal Attention Module (MAM): Applies attention mechanisms to selectively focus on and fuse important features from the RGB, depth, and flow streams.
Refinement Fusion Module (RFM): Combines the multi-modal features in a hierarchical manner, using a Universal Interaction Module and Holistic Multi-Modal Attentive Paths to further refine the fused representation.

The researchers conduct extensive experiments on both their new RDVS dataset and pseudo RGB-D video datasets. They demonstrate that DCTNet+ outperforms 17 state-of-the-art video salient object detection models and 14 RGB-D salient object detection models. Ablation studies highlight the contributions of the individual DCTNet+ modules and the importance of using realistic depth data.

Critical Analysis

The paper makes a strong contribution by addressing the under-explored area of salient object detection in RGB-D videos. The new RDVS dataset provides a valuable resource for future research, filling an important gap in the available RGB-D video benchmarks.

While the DCTNet+ model shows impressive performance, the paper could have delved deeper into the model's limitations and potential failure cases. For example, how well does DCTNet+ generalize to more challenging or atypical scenes beyond the RDVS dataset? Additionally, the paper does not discuss potential computational or memory efficiency concerns of the model's complex architecture.

Furthermore, the research could be strengthened by incorporating more analysis on the model's learned features and attention mechanisms. Understanding how the different modalities (RGB, depth, flow) are weighted and combined by DCTNet+ could yield valuable insights for future work in this area.

Despite these minor shortcomings, this paper represents a significant advancement in RGB-D video salient object detection. The RDVS dataset and the DCTNet+ model provide a strong foundation for continued progress in this emerging field of computer vision.

Conclusion

This paper makes important contributions to the field of salient object detection in RGB-D videos, a promising but under-explored area of computer vision. By introducing the diverse RDVS dataset and the advanced DCTNet+ model, the researchers have significantly advanced the state-of-the-art in this domain.

The RDVS dataset provides a high-quality benchmark for evaluating RGB-D video salient object detection models, addressing limitations of previous datasets. The DCTNet+ model demonstrates strong performance by effectively integrating color, depth, and motion features through innovative attention and fusion modules.

These advancements in RGB-D video salient object detection could have far-reaching implications, benefiting a wide range of applications such as [internal link: autonomous driving], [internal link: video surveillance], and [internal link: image/video editing]. As depth-sensing technologies continue to proliferate, the techniques developed in this paper will become increasingly valuable for accurately identifying the most important visual elements in complex, dynamic scenes.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

ViDSOD-100: A New Dataset and a Baseline Model for RGB-D Video Salient Object Detection

Junhao Lin, Lei Zhu, Jiaxing Shen, Huazhu Fu, Qing Zhang, Liansheng Wang

With the rapid development of depth sensor, more and more RGB-D videos could be obtained. Identifying the foreground in RGB-D videos is a fundamental and important task. However, the existing salient object detection (SOD) works only focus on either static RGB-D images or RGB videos, ignoring the collaborating of RGB-D and video information. In this paper, we first collect a new annotated RGB-D video SOD (ViDSOD-100) dataset, which contains 100 videos within a total of 9,362 frames, acquired from diverse natural scenes. All the frames in each video are manually annotated to a high-quality saliency annotation. Moreover, we propose a new baseline model, named attentive triple-fusion network (ATF-Net), for RGB-D video salient object detection. Our method aggregates the appearance information from an input RGB image, spatio-temporal information from an estimated motion map, and the geometry information from the depth map by devising three modality-specific branches and a multi-modality integration branch. The modality-specific branches extract the representation of different inputs, while the multi-modality integration branch combines the multi-level modality-specific features by introducing the encoder feature aggregation (MEA) modules and decoder feature aggregation (MDA) modules. The experimental findings conducted on both our newly introduced ViDSOD-100 dataset and the well-established DAVSOD dataset highlight the superior performance of the proposed ATF-Net. This performance enhancement is demonstrated both quantitatively and qualitatively, surpassing the capabilities of current state-of-the-art techniques across various domains, including RGB-D saliency detection, video saliency detection, and video object segmentation. Our data and our code are available at github.com/jhl-Det/RGBD_Video_SOD.

6/19/2024

cs.CV

🤷

Unified Unsupervised Salient Object Detection via Knowledge Transfer

Yao Yuan, Wutao Liu, Pan Gao, Qun Dai, Jie Qin

Recently, unsupervised salient object detection (USOD) has gained increasing attention due to its annotation-free nature. However, current methods mainly focus on specific tasks such as RGB and RGB-D, neglecting the potential for task migration. In this paper, we propose a unified USOD framework for generic USOD tasks. Firstly, we propose a Progressive Curriculum Learning-based Saliency Distilling (PCL-SD) mechanism to extract saliency cues from a pre-trained deep network. This mechanism starts with easy samples and progressively moves towards harder ones, to avoid initial interference caused by hard samples. Afterwards, the obtained saliency cues are utilized to train a saliency detector, and we employ a Self-rectify Pseudo-label Refinement (SPR) mechanism to improve the quality of pseudo-labels. Finally, an adapter-tuning method is devised to transfer the acquired saliency knowledge, leveraging shared knowledge to attain superior transferring performance on the target tasks. Extensive experiments on five representative SOD tasks confirm the effectiveness and feasibility of our proposed method. Code and supplement materials are available at https://github.com/I2-Multimedia-Lab/A2S-v3.

4/24/2024

cs.CV

🌐

Quality-aware Selective Fusion Network for V-D-T Salient Object Detection

Liuxin Bao, Xiaofei Zhou, Xiankai Lu, Yaoqi Sun, Haibing Yin, Zhenghui Hu, Jiyong Zhang, Chenggang Yan

Depth images and thermal images contain the spatial geometry information and surface temperature information, which can act as complementary information for the RGB modality. However, the quality of the depth and thermal images is often unreliable in some challenging scenarios, which will result in the performance degradation of the two-modal based salient object detection (SOD). Meanwhile, some researchers pay attention to the triple-modal SOD task, where they attempt to explore the complementarity of the RGB image, the depth image, and the thermal image. However, existing triple-modal SOD methods fail to perceive the quality of depth maps and thermal images, which leads to performance degradation when dealing with scenes with low-quality depth and thermal images. Therefore, we propose a quality-aware selective fusion network (QSF-Net) to conduct VDT salient object detection, which contains three subnets including the initial feature extraction subnet, the quality-aware region selection subnet, and the region-guided selective fusion subnet. Firstly, except for extracting features, the initial feature extraction subnet can generate a preliminary prediction map from each modality via a shrinkage pyramid architecture. Then, we design the weakly-supervised quality-aware region selection subnet to generate the quality-aware maps. Concretely, we first find the high-quality and low-quality regions by using the preliminary predictions, which further constitute the pseudo label that can be used to train this subnet. Finally, the region-guided selective fusion subnet purifies the initial features under the guidance of the quality-aware maps, and then fuses the triple-modal features and refines the edge details of prediction maps through the intra-modality and inter-modality attention (IIA) module and the edge refinement (ER) module, respectively. Extensive experiments are performed on VDT-2048

5/14/2024

cs.CV

🔎

Salient Object Detection From Arbitrary Modalities

Nianchang Huang, Yang Yang, Ruida Xi, Qiang Zhang, Jungong Han, Jin Huang

Toward desirable saliency prediction, the types and numbers of inputs for a salient object detection (SOD) algorithm may dynamically change in many real-life applications. However, existing SOD algorithms are mainly designed or trained for one particular type of inputs, failing to be generalized to other types of inputs. Consequentially, more types of SOD algorithms need to be prepared in advance for handling different types of inputs, raising huge hardware and research costs. Differently, in this paper, we propose a new type of SOD task, termed Arbitrary Modality SOD (AM SOD). The most prominent characteristics of AM SOD are that the modality types and modality numbers will be arbitrary or dynamically changed. The former means that the inputs to the AM SOD algorithm may be arbitrary modalities such as RGB, depths, or even any combination of them. While, the latter indicates that the inputs may have arbitrary modality numbers as the input type is changed, e.g. single-modality RGB image, dual-modality RGB-Depth (RGB-D) images or triple-modality RGB-Depth-Thermal (RGB-D-T) images. Accordingly, a preliminary solution to the above challenges, i.e. a modality switch network (MSN), is proposed in this paper. In particular, a modality switch feature extractor (MSFE) is first designed to extract discriminative features from each modality effectively by introducing some modality indicators, which will generate some weights for modality switching. Subsequently, a dynamic fusion module (DFM) is proposed to adaptively fuse features from a variable number of modalities based on a novel Transformer structure. Finally, a new dataset, named AM-XD, is constructed to facilitate research on AM SOD. Extensive experiments demonstrate that our AM SOD method can effectively cope with changes in the type and number of input modalities for robust salient object detection.

5/10/2024

cs.CV