DerainNeRF: 3D Scene Estimation with Adhesive Waterdrop Removal

Read original: arXiv:2403.20013 - Published 4/1/2024 by Yunhao Li, Jing Wu, Lingzhe Zhao, Peidong Liu

DerainNeRF: 3D Scene Estimation with Adhesive Waterdrop Removal

INTRODUCTION

The paper discusses the importance of 3D scene representation and estimation techniques in various applications, such as autonomous driving, robotics, virtual reality (VR), and cultural heritage preservation. Neural Radiance Fields (NeRF) have gained popularity for 3D reconstruction and scene representation due to their ability to provide continuous scene representation, handle complex scenes, and achieve state-of-the-art performance in novel-view image synthesis.

However, in real-world scenarios involving outdoor images, particularly those captured during rainy or snowy conditions, the presence of adhesive raindrops can degrade the performance of related applications like 3D reconstruction, visual perception, object detection, and tracking. To address this limitation, the paper proposes a method to simultaneously remove the adhesive raindrops from captured images and recover the underlying 3D scene implicitly, leveraging the impressive 3D scene representation capability of NeRF.

Figure 1: Given a set of waterdrop images (left column), our DerainNeRF estimates 3D scenes and reomves the adhesive waterdrops altogether. It synthesizes clear images (right column) with high quality.

The paper proposes DerainNeRF, a NeRF-based framework that simultaneously estimates 3D scenes while removing waterdrops from images. The vanilla NeRF framework is not robust to images with adhesive waterdrops, which have random spatial distribution, irregular shapes, and complicated refraction and reflection properties.

The proposed method combines a waterdrop detection network and NeRF for 3D scene representation learning. It first uses a pre-trained deep waterdrop detector to predict the locations of waterdrops. Then, it excludes the waterdrop-covered pixels during the training of NeRF, allowing it to recover clear scenes from non-occluded pixels.

DerainNeRF is evaluated using synthetic and real datasets. The experimental results demonstrate that it effectively estimates clear 3D scenes from waterdrop images and renders novel-view clear images. Both quantitative and qualitative results show that DerainNeRF delivers superior quality compared to existing state-of-the-art image waterdrop removal methods. It is the first NeRF-based method that takes waterdrop-degraded images as input and recovers the clear scene implicitly.

RELATED WORK

The paper reviews prior work in two main areas related to the proposed method: Neural Radiance Fields (NeRF) and image adhesive waterdrop removal.

On NeRF:

NeRF is a technique for 3D scene representation and novel view synthesis using a neural network to model the scene's volumetric radiance field.
Many variants of NeRF have been proposed to handle large-scale scenes, challenging imaging conditions, high dynamic range, and scene editing.
Some methods extend NeRF to handle refractive objects like water drops by modeling light refraction.
Prior work on occlusion removal with NeRF either requires user masking or additional networks to separate backgrounds.

On image adhesive waterdrop removal:

Early methods modeled waterdrop geometry, refraction, and reflection properties.
Other approaches used temporal or spatial features to separate waterdrops from backgrounds.
Recent deep learning methods use convolutional neural networks, attention mechanisms, and adversarial training for waterdrop removal.
However, existing methods have limitations in handling large waterdrop areas or leveraging global spatial information.

No text is provided to summarize for the image adhesive waterdrop removal section.

METHOD

Figure 2: Training procedure of DerainNeRF. A pre-trained deep waterdrop detector detects waterdrops in input images and generate binary masks, then DerainNeRF utilizes the masks to block waterdrop regions in input images during NeRF training.

The paper proposes a method called DerainNeRF to remove waterdrops from multi-view images. It handles two scenarios: waterdrops fixed in the scene while the camera moves, and waterdrops static relative to the camera lens.

The method first uses the AttGAN model to detect image regions covered by waterdrops. It then excludes those regions from the training of NeRF, a technique for estimating 3D scene geometry and appearance from multi-view images.

For each input image, a binary mask is generated from the AttGAN attention map indicating waterdrop probability. NeRF is trained by masking the photometric loss for pixels covered by waterdrops according to this mask.

When waterdrops adhere to the camera lens, an additional mask is computed by averaging the attention maps over multiple frames. This accounts for areas where some waterdrops were missed due to over/underexposure in individual frames.

After training, DerainNeRF can recover the clear 3D scene without waterdrops and render novel views from arbitrary camera poses.

V EXPERIMENTS

The provided text discusses the implementation details and experimental results of the proposed DerainNeRF method for removing waterdrops from images and reconstructing clear 3D scenes. Key points are:

Implementation Details:

Pre-trained waterdrop detector from AttGAN is used to obtain binary masks.
ADAM optimizer is used with a decaying learning rate for DerainNeRF training.
200K iterations with 1024 rays per batch on an RTX 3090 GPU.
COLMAP is used for estimating camera poses from input images.

Datasets:

Synthetic dataset from Blender scenes with simulated waterdrops, including fixed waterdrops and moving waterdrops scenarios.
Real indoor dataset captured with a camera and waterdrop-covered glass, including fixed glass and moving glass scenarios.
Real outdoor dataset from a vehicle in rainy conditions with waterdrops on the camera lens.

Results:

Qualitative and quantitative comparisons against NeRF, AttGAN, and other SOTA methods on synthetic and real datasets.
DerainNeRF outperforms other methods in PSNR and LPIPS metrics on the synthetic dataset.
DerainNeRF effectively removes waterdrops of various sizes and shapes on real indoor and outdoor datasets.

Ablation Study:

Analyzes the effect of mask enhancement using average attention maps.
Mask enhancement improves the quality of synthesized images, especially with dense waterdrops.

CONCLUSIONS

The paper introduces DerainNeRF, a novel approach for 3D scene estimation from multi-view images degraded by waterdrops, utilizing the Neural Radiance Fields (NeRF) representation. The method addresses the challenge of waterdrop removal through a comprehensive pipeline. First, a pre-trained waterdrop detector identifies and localizes waterdrops in the input images. Then, the approach estimates clear scenes by leveraging a NeRF-based network that exploits non-occluded pixels. The authors conducted a thorough evaluation of their proposed method against existing state-of-the-art techniques for image waterdrop removal, using both synthetic and real datasets. The experimental results demonstrate the superior performance of DerainNeRF compared to existing approaches.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Follow @aimodelsfyi on 𝕏 →

Related Papers

DerainNeRF: 3D Scene Estimation with Adhesive Waterdrop Removal

Yunhao Li, Jing Wu, Lingzhe Zhao, Peidong Liu

When capturing images through the glass during rainy or snowy weather conditions, the resulting images often contain waterdrops adhered on the glass surface, and these waterdrops significantly degrade the image quality and performance of many computer vision algorithms. To tackle these limitations, we propose a method to reconstruct the clear 3D scene implicitly from multi-view images degraded by waterdrops. Our method exploits an attention network to predict the location of waterdrops and then train a Neural Radiance Fields to recover the 3D scene implicitly. By leveraging the strong scene representation capabilities of NeRF, our method can render high-quality novel-view images with waterdrops removed. Extensive experimental results on both synthetic and real datasets show that our method is able to generate clear 3D scenes and outperforms existing state-of-the-art (SOTA) image adhesive waterdrop removal methods.

4/1/2024

DecentNeRFs: Decentralized Neural Radiance Fields from Crowdsourced Images

Zaid Tasneem, Akshat Dave, Abhishek Singh, Kushagra Tiwary, Praneeth Vepakomma, Ashok Veeraraghavan, Ramesh Raskar

Neural radiance fields (NeRFs) show potential for transforming images captured worldwide into immersive 3D visual experiences. However, most of this captured visual data remains siloed in our camera rolls as these images contain personal details. Even if made public, the problem of learning 3D representations of billions of scenes captured daily in a centralized manner is computationally intractable. Our approach, DecentNeRF, is the first attempt at decentralized, crowd-sourced NeRFs that require $sim 10^4times$ less server computing for a scene than a centralized approach. Instead of sending the raw data, our approach requires users to send a 3D representation, distributing the high computation cost of training centralized NeRFs between the users. It learns photorealistic scene representations by decomposing users' 3D views into personal and global NeRFs and a novel optimally weighted aggregation of only the latter. We validate the advantage of our approach to learn NeRFs with photorealism and minimal server computation cost on structured synthetic and real-world photo tourism datasets. We further analyze how secure aggregation of global NeRFs in DecentNeRF minimizes the undesired reconstruction of personal content by the server.

4/1/2024

DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features

Letian Wang, Seung Wook Kim, Jiawei Yang, Cunjun Yu, Boris Ivanovic, Steven L. Waslander, Yue Wang, Sanja Fidler, Marco Pavone, Peter Karkus

We propose DistillNeRF, a self-supervised learning framework addressing the challenge of understanding 3D environments from limited 2D observations in autonomous driving. Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs, and is trained self-supervised with differentiable rendering to reconstruct RGB, depth, or feature images. Our first insight is to exploit per-scene optimized Neural Radiance Fields (NeRFs) by generating dense depth and virtual camera targets for training, thereby helping our model to learn 3D geometry from sparse non-overlapping image inputs. Second, to learn a semantically rich 3D representation, we propose distilling features from pre-trained 2D foundation models, such as CLIP or DINOv2, thereby enabling various downstream tasks without the need for costly 3D human annotations. To leverage these two insights, we introduce a novel model architecture with a two-stage lift-splat-shoot encoder and a parameterized sparse hierarchical voxel representation. Experimental results on the NuScenes dataset demonstrate that DistillNeRF significantly outperforms existing comparable self-supervised methods for scene reconstruction, novel view synthesis, and depth estimation; and it allows for competitive zero-shot 3D semantic occupancy prediction, as well as open-world scene understanding through distilled foundation model features. Demos and code will be available at https://distillnerf.github.io/.

6/19/2024

Generative Lifting of Multiview to 3D from Unknown Pose: Wrapping NeRF inside Diffusion

Xin Yuan, Rana Hanocka, Michael Maire

We cast multiview reconstruction from unknown pose as a generative modeling problem. From a collection of unannotated 2D images of a scene, our approach simultaneously learns both a network to predict camera pose from 2D image input, as well as the parameters of a Neural Radiance Field (NeRF) for the 3D scene. To drive learning, we wrap both the pose prediction network and NeRF inside a Denoising Diffusion Probabilistic Model (DDPM) and train the system via the standard denoising objective. Our framework requires the system accomplish the task of denoising an input 2D image by predicting its pose and rendering the NeRF from that pose. Learning to denoise thus forces the system to concurrently learn the underlying 3D NeRF representation and a mapping from images to camera extrinsic parameters. To facilitate the latter, we design a custom network architecture to represent pose as a distribution, granting implicit capacity for discovering view correspondences when trained end-to-end for denoising alone. This technique allows our system to successfully build NeRFs, without pose knowledge, for challenging scenes where competing methods fail. At the conclusion of training, our learned NeRF can be extracted and used as a 3D scene model; our full system can be used to sample novel camera poses and generate novel-view images.

6/12/2024