Video Instance Shadow Detection

Read original: arXiv:2211.12827 - Published 5/7/2024 by Zhenghao Xing, Tianyu Wang, Xiaowei Hu, Haoran Wu, Chi-Wing Fu, Pheng-Ann Heng

🔎

Overview

Instance shadow detection is crucial for applications like photo editing and light direction estimation.
Significant advancements have been made in predicting shadow instances, object instances, and their associations.
Extending this task to videos presents challenges in annotating diverse data and addressing complexities like occlusion and temporary disappearances.
To address these challenges, the authors introduce ViShadow, a semi-supervised video instance shadow detection framework.

Plain English Explanation

ViShadow is a system that can detect and track shadows and objects in videos. This is important for applications like photo editing, where you might want to remove or adjust shadows, or for estimating the direction of light in a scene.

The key innovation of ViShadow is that it uses both labeled image data and unlabeled video data to train the system. The first stage uses the labeled image data to identify shadow and object instances through a technique called contrastive learning, which helps the system learn to pair shadows and objects across frames.

The second stage then takes the unlabeled video data and uses a technique called associated cycle consistency loss to further improve the system's ability to track the shadows and objects as they move through the video. This helps it handle challenges like when objects or shadows disappear temporarily due to occlusion.

The researchers also introduced a new dataset called SOBA-VID, which includes unlabeled training videos and labeled testing videos, as well as a new metric called SOAP-VID for evaluating video instance shadow detection solutions.

Finally, the researchers demonstrated the effectiveness of ViShadow through various video-level applications, such as video inpainting, instance cloning, shadow editing, and text-instructed shadow-object manipulation.

Technical Explanation

ViShadow is a two-stage semi-supervised framework for video instance shadow detection. In the first stage, it utilizes labeled image data to identify shadow and object instances through contrastive learning, which learns to pair shadows and objects across frames.

The second stage then employs unlabeled video data, incorporating an associated cycle consistency loss to enhance the system's tracking ability. This helps address challenges like occlusion and temporary disappearances of objects and shadows within the video.

To manage temporary disappearances and ensure tracking continuity, the researchers introduced a retrieval mechanism. They also created a new dataset, SOBA-VID, which includes unlabeled training videos and labeled testing videos, as well as a new evaluation metric, SOAP-VID.

The effectiveness of ViShadow is demonstrated through various video-level applications, such as video inpainting, instance cloning, shadow editing, and text-instructed shadow-object manipulation.

Critical Analysis

The paper introduces a novel approach to video instance shadow detection, but it does not address certain limitations. For example, the system may struggle with complex scenes with multiple overlapping shadows or in cases where shadows are cast by transparent or reflective objects.

Additionally, the authors mention that the SOBA-VID dataset is limited in diversity, and the proposed SOAP-VID metric may not capture all relevant aspects of video instance shadow detection performance.

Further research could explore ways to improve the system's robustness to these types of challenges, as well as investigate the potential of self-supervised or unsupervised learning techniques to reduce the reliance on labeled data.

Conclusion

ViShadow presents a significant step forward in video instance shadow detection, a crucial task for applications like photo editing and light direction estimation. By leveraging both labeled image data and unlabeled video data, the system demonstrates impressive performance in identifying and tracking shadows and objects across diverse video scenes.

The introduction of the SOBA-VID dataset and SOAP-VID metric provides a valuable resource for further research and development in this area. While the system has some limitations, the insights and techniques presented in this work have the potential to drive continued advancements in video-based shadow detection and manipulation, ultimately enhancing the capabilities of various multimedia applications.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🔎

Video Instance Shadow Detection

Zhenghao Xing, Tianyu Wang, Xiaowei Hu, Haoran Wu, Chi-Wing Fu, Pheng-Ann Heng

Instance shadow detection, crucial for applications such as photo editing and light direction estimation, has undergone significant advancements in predicting shadow instances, object instances, and their associations. The extension of this task to videos presents challenges in annotating diverse video data and addressing complexities arising from occlusion and temporary disappearances within associations. In response to these challenges, we introduce ViShadow, a semi-supervised video instance shadow detection framework that leverages both labeled image data and unlabeled video data for training. ViShadow features a two-stage training pipeline: the first stage, utilizing labeled image data, identifies shadow and object instances through contrastive learning for cross-frame pairing. The second stage employs unlabeled videos, incorporating an associated cycle consistency loss to enhance tracking ability. A retrieval mechanism is introduced to manage temporary disappearances, ensuring tracking continuity. The SOBA-VID dataset, comprising unlabeled training videos and labeled testing videos, along with the SOAP-VID metric, is introduced for the quantitative evaluation of VISD solutions. The effectiveness of ViShadow is further demonstrated through various video-level applications such as video inpainting, instance cloning, shadow editing, and text-instructed shadow-object manipulation.

5/7/2024

Language-Driven Interactive Shadow Detection

Hongqiu Wang, Wei Wang, Haipeng Zhou, Huihui Xu, Shaozhi Wu, Lei Zhu

Traditional shadow detectors often identify all shadow regions of static images or video sequences. This work presents the Referring Video Shadow Detection (RVSD), which is an innovative task that rejuvenates the classic paradigm by facilitating the segmentation of particular shadows in videos based on descriptive natural language prompts. This novel RVSD not only achieves segmentation of arbitrary shadow areas of interest based on descriptions (flexibility) but also allows users to interact with visual content more directly and naturally by using natural language prompts (interactivity), paving the way for abundant applications ranging from advanced video editing to virtual reality experiences. To pioneer the RVSD research, we curated a well-annotated RVSD dataset, which encompasses 86 videos and a rich set of 15,011 paired textual descriptions with corresponding shadows. To the best of our knowledge, this dataset is the first one for addressing RVSD. Based on this dataset, we propose a Referring Shadow-Track Memory Network (RSM-Net) for addressing the RVSD task. In our RSM-Net, we devise a Twin-Track Synergistic Memory (TSM) to store intra-clip memory features and hierarchical inter-clip memory features, and then pass these memory features into a memory read module to refine features of the current video frame for referring shadow detection. We also develop a Mixed-Prior Shadow Attention (MSA) to utilize physical priors to obtain a coarse shadow map for learning more visual features by weighting it with the input video frame. Experimental results show that our RSM-Net achieves state-of-the-art performance for RVSD with a notable Overall IOU increase of 4.4%. Our code and dataset are available at https://github.com/whq-xxh/RVSD.

8/19/2024

Unveiling Deep Shadows: A Survey on Image and Video Shadow Detection, Removal, and Generation in the Era of Deep Learning

Xiaowei Hu, Zhenghao Xing, Tianyu Wang, Chi-Wing Fu, Pheng-Ann Heng

Shadows are formed when light encounters obstacles, leading to areas of diminished illumination. In computer vision, shadow detection, removal, and generation are crucial for enhancing scene understanding, refining image quality, ensuring visual consistency in video editing, and improving virtual environments. This paper presents a comprehensive survey of shadow detection, removal, and generation in images and videos within the deep learning landscape over the past decade, covering tasks, deep models, datasets, and evaluation metrics. Our key contributions include a comprehensive survey of shadow analysis, standardization of experimental comparisons, exploration of the relationships among model size, speed, and performance, a cross-dataset generalization study, identification of open issues and future directions, and provision of publicly available resources to support further research.

9/4/2024

PM-VIS: High-Performance Box-Supervised Video Instance Segmentation

Zhangjing Yang, Dun Liu, Wensheng Cheng, Jinqiao Wang, Yi Wu

Labeling pixel-wise object masks in videos is a resource-intensive and laborious process. Box-supervised Video Instance Segmentation (VIS) methods have emerged as a viable solution to mitigate the labor-intensive annotation process. . In practical applications, the two-step approach is not only more flexible but also exhibits a higher recognition accuracy. Inspired by the recent success of Segment Anything Model (SAM), we introduce a novel approach that aims at harnessing instance box annotations from multiple perspectives to generate high-quality instance pseudo masks, thus enriching the information contained in instance annotations. We leverage ground-truth boxes to create three types of pseudo masks using the HQ-SAM model, the box-supervised VIS model (IDOL-BoxInst), and the VOS model (DeAOT) separately, along with three corresponding optimization mechanisms. Additionally, we introduce two ground-truth data filtering methods, assisted by high-quality pseudo masks, to further enhance the training dataset quality and improve the performance of fully supervised VIS methods. To fully capitalize on the obtained high-quality Pseudo Masks, we introduce a novel algorithm, PM-VIS, to integrate mask losses into IDOL-BoxInst. Our PM-VIS model, trained with high-quality pseudo mask annotations, demonstrates strong ability in instance mask prediction, achieving state-of-the-art performance on the YouTube-VIS 2019, YouTube-VIS 2021, and OVIS validation sets, notably narrowing the gap between box-supervised and fully supervised VIS methods.

4/23/2024