DeS3: Adaptive Attention-driven Self and Soft Shadow Removal using ViT Similarity

Read original: arXiv:2211.08089 - Published 4/16/2024 by Yeying Jin, Wei Ye, Wenhan Yang, Yuan Yuan, Robby T. Tan

🧠

Overview

Removing soft and self shadows from a single image is a challenging task.
Self shadows are shadows cast on the object itself.
Most existing methods rely on binary shadow masks, without considering the ambiguous boundaries of soft and self shadows.
This paper presents DeS3, a method that removes hard, soft, and self shadows based on adaptive attention and ViT similarity.
The novel ViT similarity loss utilizes features extracted from a pre-trained Vision Transformer to help guide the reverse sampling towards recovering scene structures.
The adaptive attention is able to differentiate shadow regions from the underlying objects, as well as shadow regions from the object casting the shadow.
This capability enables DeS3 to better recover the structures of objects even when they are partially occluded by shadows.
Unlike existing methods that rely on constraints during the training phase, DeS3 incorporates the ViT similarity during the sampling stage.

Plain English Explanation

Removing shadows from a single image is a challenging task, especially when the shadows are soft or cast on the object itself (self-shadows). Most existing methods use a simple binary mask to identify shadow regions, but this doesn't work well for the blurry edges of soft shadows or the complex shadows cast by objects.

The DeS3 method proposed in this paper uses a more advanced approach to handle these types of shadows. It employs a ViT similarity loss, which uses features extracted from a pre-trained Vision Transformer model to help the system understand the structure of the scene and recover details that were hidden by the shadows.

The system also uses an "adaptive attention" mechanism that can differentiate between the shadow regions, the objects casting the shadows, and the objects being shadowed. This allows it to do a better job of recovering the true structure of the objects, even when they are partially obscured by shadows.

Unlike previous methods that added constraints during training, DeS3 incorporates the ViT similarity loss during the actual process of removing the shadows from the image. This helps it handle a wider range of shadow types more effectively.

Technical Explanation

The DeS3 method proposed in this paper uses a combination of adaptive attention and a novel ViT similarity loss to remove hard, soft, and self shadows from a single input image.

The adaptive attention mechanism allows DeS3 to differentiate between shadow regions, the objects casting the shadows, and the objects being shadowed. This helps it better recover the underlying structure of the scene, even when objects are partially occluded by shadows.

The ViT similarity loss utilizes features extracted from a pre-trained Vision Transformer model to guide the reverse sampling process towards recovering the true scene structure. This is in contrast to previous methods that relied on constraints applied during the training phase.

The authors evaluated DeS3 on several benchmark datasets, including SRD, AISTD, LRSS, USR, and UIUC. Their method outperformed state-of-the-art approaches, particularly on the LRSS dataset, where it achieved a 16% improvement in RMSE over the previous best method.

Critical Analysis

The authors acknowledge that while DeS3 outperforms existing methods, there is still room for improvement, especially when handling complex shadow interactions and recovering fine details. They suggest that incorporating additional co-supervision signals or using more advanced vision transformer architectures could further improve the method's performance.

One potential limitation of the approach is its reliance on the pre-trained ViT model, which may not be fully optimized for the specific task of shadow removal. The authors do not provide a detailed analysis of how the choice of ViT model or its training affects the overall performance of DeS3.

Additionally, the paper does not address the potential for the method to introduce artifacts or distortions in the recovered image, which could be a concern for applications where visual fidelity is crucial, such as image editing.

Conclusion

The DeS3 method presented in this paper is a significant advancement in the field of single-image shadow removal. By combining adaptive attention and a novel ViT similarity loss, it is able to effectively remove hard, soft, and self shadows while better preserving the underlying structure of the scene.

The authors' evaluation on multiple benchmark datasets demonstrates the effectiveness of their approach, particularly in challenging scenarios with complex shadow interactions. While there is still room for improvement, DeS3 represents an important step towards more robust and versatile shadow removal techniques that could have valuable applications in areas like computational photography, image editing, and scene understanding.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

🧠

DeS3: Adaptive Attention-driven Self and Soft Shadow Removal using ViT Similarity

Yeying Jin, Wei Ye, Wenhan Yang, Yuan Yuan, Robby T. Tan

Removing soft and self shadows that lack clear boundaries from a single image is still challenging. Self shadows are shadows that are cast on the object itself. Most existing methods rely on binary shadow masks, without considering the ambiguous boundaries of soft and self shadows. In this paper, we present DeS3, a method that removes hard, soft and self shadows based on adaptive attention and ViT similarity. Our novel ViT similarity loss utilizes features extracted from a pre-trained Vision Transformer. This loss helps guide the reverse sampling towards recovering scene structures. Our adaptive attention is able to differentiate shadow regions from the underlying objects, as well as shadow regions from the object casting the shadow. This capability enables DeS3 to better recover the structures of objects even when they are partially occluded by shadows. Different from existing methods that rely on constraints during the training phase, we incorporate the ViT similarity during the sampling stage. Our method outperforms state-of-the-art methods on the SRD, AISTD, LRSS, USR and UIUC datasets, removing hard, soft, and self shadows robustly. Specifically, our method outperforms the SOTA method by 16% of the RMSE of the whole image on the LRSS dataset. Our data and code is available at: url{https://github.com/jinyeying/DeS3_Deshadow}

4/16/2024

📈

Soft-Hard Attention U-Net Model and Benchmark Dataset for Multiscale Image Shadow Removal

Eirini Cholopoulou, Dimitrios E. Diamantis, Dimitra-Christina C. Koutsiou, Dimitris K. Iakovidis

Effective shadow removal is pivotal in enhancing the visual quality of images in various applications, ranging from computer vision to digital photography. During the last decades physics and machine learning -based methodologies have been proposed; however, most of them have limited capacity in capturing complex shadow patterns due to restrictive model assumptions, neglecting the fact that shadows usually appear at different scales. Also, current datasets used for benchmarking shadow removal are composed of a limited number of images with simple scenes containing mainly uniform shadows cast by single objects, whereas only a few of them include both manual shadow annotations and paired shadow-free images. Aiming to address all these limitations in the context of natural scene imaging, including urban environments with complex scenes, the contribution of this study is twofold: a) it proposes a novel deep learning architecture, named Soft-Hard Attention U-net (SHAU), focusing on multiscale shadow removal; b) it provides a novel synthetic dataset, named Multiscale Shadow Removal Dataset (MSRD), containing complex shadow patterns of multiple scales, aiming to serve as a privacy-preserving dataset for a more comprehensive benchmarking of future shadow removal methodologies. Key architectural components of SHAU are the soft and hard attention modules, which along with multiscale feature extraction blocks enable effective shadow removal of different scales and intensities. The results demonstrate the effectiveness of SHAU over the relevant state-of-the-art shadow removal methods across various benchmark datasets, improving the Peak Signal-to-Noise Ratio and Root Mean Square Error for the shadow area by 25.1% and 61.3%, respectively.

8/9/2024

Language-Driven Interactive Shadow Detection

Hongqiu Wang, Wei Wang, Haipeng Zhou, Huihui Xu, Shaozhi Wu, Lei Zhu

Traditional shadow detectors often identify all shadow regions of static images or video sequences. This work presents the Referring Video Shadow Detection (RVSD), which is an innovative task that rejuvenates the classic paradigm by facilitating the segmentation of particular shadows in videos based on descriptive natural language prompts. This novel RVSD not only achieves segmentation of arbitrary shadow areas of interest based on descriptions (flexibility) but also allows users to interact with visual content more directly and naturally by using natural language prompts (interactivity), paving the way for abundant applications ranging from advanced video editing to virtual reality experiences. To pioneer the RVSD research, we curated a well-annotated RVSD dataset, which encompasses 86 videos and a rich set of 15,011 paired textual descriptions with corresponding shadows. To the best of our knowledge, this dataset is the first one for addressing RVSD. Based on this dataset, we propose a Referring Shadow-Track Memory Network (RSM-Net) for addressing the RVSD task. In our RSM-Net, we devise a Twin-Track Synergistic Memory (TSM) to store intra-clip memory features and hierarchical inter-clip memory features, and then pass these memory features into a memory read module to refine features of the current video frame for referring shadow detection. We also develop a Mixed-Prior Shadow Attention (MSA) to utilize physical priors to obtain a coarse shadow map for learning more visual features by weighting it with the input video frame. Experimental results show that our RSM-Net achieves state-of-the-art performance for RVSD with a notable Overall IOU increase of 4.4%. Our code and dataset are available at https://github.com/whq-xxh/RVSD.

8/19/2024

🔎

DocDeshadower: Frequency-Aware Transformer for Document Shadow Removal

Ziyang Zhou, Yingtie Lei, Xuhang Chen, Shenghong Luo, Wenjun Zhang, Chi-Man Pun, Zhen Wang

Shadows in scanned documents pose significant challenges for document analysis and recognition tasks due to their negative impact on visual quality and readability. Current shadow removal techniques, including traditional methods and deep learning approaches, face limitations in handling varying shadow intensities and preserving document details. To address these issues, we propose DocDeshadower, a novel multi-frequency Transformer-based model built upon the Laplacian Pyramid. By decomposing the shadow image into multiple frequency bands and employing two critical modules: the Attention-Aggregation Network for low-frequency shadow removal and the Gated Multi-scale Fusion Transformer for global refinement. DocDeshadower effectively removes shadows at different scales while preserving document content. Extensive experiments demonstrate DocDeshadower's superior performance compared to state-of-the-art methods, highlighting its potential to significantly improve document shadow removal techniques. The code is available at https://github.com/leiyingtie/DocDeshadower.

7/31/2024