Stable Surface Regularization for Fast Few-Shot NeRF

2403.19985

Published 4/1/2024 by Byeongin Joung, Byeong-Uk Lee, Jaesung Choe, Ukcheol Shin, Minjun Kang, Taeyeop Lee, In So Kweon, Kuk-Jin Yoon

cs.CV

Stable Surface Regularization for Fast Few-Shot NeRF

Abstract

This paper proposes an algorithm for synthesizing novel views under few-shot setup. The main concept is to develop a stable surface regularization technique called Annealing Signed Distance Function (ASDF), which anneals the surface in a coarse-to-fine manner to accelerate convergence speed. We observe that the Eikonal loss - which is a widely known geometric regularization - requires dense training signal to shape different level-sets of SDF, leading to low-fidelity results under few-shot training. In contrast, the proposed surface regularization successfully reconstructs scenes and produce high-fidelity geometry with stable training. Our method is further accelerated by utilizing grid representation and monocular geometric priors. Finally, the proposed approach is up to 45 times faster than existing few-shot novel view synthesis methods, and it produces comparable results in the ScanNet dataset and NeRF-Real dataset.

Create account to get full access

Introduction

Neural Radiance Fields (NeRF) is a technique that uses Multi-Layer Perceptrons (MLPs) and a large number of training images to encode the appearance and geometry of a scene. NeRF has demonstrated its effectiveness in various applications, such as novel view synthesis, surface reconstruction, dynamic scene rendering, and lighting. However, building a neural implicit field with fewer input images is challenging, and the optimization process required for each scene can take over 10 hours, making it difficult to apply in real-world applications.

Figure 1: Our method can synthesize novel views within 30 minutes by utilizing multi-level voxel grid optimization. To overcome the limitation of novel view synthesis with sparse input images, we utilized additional strong geometric cues and our novel geometric smoothing loss, Annealing SDF loss.

Figure 2:
Novel view synthesis comparisons with state-of-the-art works (a) DSNeRF [4] (b) DDPNeRF [21] (c) voxel grids based approach [24] with eikonal loss [8] and depth supervision, and (d) ours. Top row: color images, bottom row: depth maps.

Figure 2: Novel view synthesis comparisons with state-of-the-art works (a) DSNeRF [4] (b) DDPNeRF [21] (c) voxel grids based approach [24] with eikonal loss [8] and depth supervision, and (d) ours. Top row: color images, bottom row: depth maps.

The paper discusses strategies to overcome issues in few-shot setups, such as unobserved viewpoint regularization, entropy minimization, and geometric priors. It also presents voxel representation and hash encoding as methods for fast training in novel-view synthesis tasks. However, these previous approaches require long training times or struggle with sparse input views.

The paper's key contribution is the proposal of a novel method called Annealing Signed Distance Function (ASDF) loss, which addresses the limitations of the Eikonal loss when applied to few-shot NeRF training. The Eikonal loss can fail to capture reliable color and geometry information in such settings, especially in voxel-based environments with sparse inputs. The ASDF loss enforces strong geometric smoothing early in training, enabling coarse-to-fine surface regularization and stable convergence.

By incorporating the ASDF loss, along with dense 3D predictions and multi-view consistency, the proposed algorithm can successfully synthesize novel views from few-shot training, while maintaining fast training times. The paper claims the approach demonstrates comparable performance with 30-45 times faster training speeds compared to previous methods.

$Figure 3: The overall pipeline. The proposed architecture utilizes structure from motion to extract sparse 3D information and camera poses from sparse input views, while off-the-shelf depth and surface normal are obtained from a pretrained network [5]. Points are sampled with respect to the camera from the structure from motion, and the feature value at corresponding points is extracted. SDF values, gradient of SDF values, and RGB values are decoded by simple MLP decoder, and RGB, surface normal, and depth values are extracted along the ray using volumetric rendering. The rendered RGB, depth map, and surface normal are supervised with their respective label using the loss functions LC,LDsubscript𝐿𝐶subscript𝐿𝐷L_{C},L_{D}italic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, and LNsubscript𝐿𝑁L_{N}italic_L start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, while SDF values are supervised with the loss functions LASDFsubscript𝐿ASDFL_{\text{ASDF}}italic_L start_POSTSUBSCRIPT ASDF end_POSTSUBSCRIPT which is composed by LGSsubscript𝐿GSL_{\text{GS}}italic_L start_POSTSUBSCRIPT GS end_POSTSUBSCRIPT, and LwEiksubscript𝐿wEikL_{\text{wEik}}italic_L start_POSTSUBSCRIPT wEik end_POSTSUBSCRIPT.$

Figure 3: The overall pipeline. The proposed architecture utilizes structure from motion to extract sparse 3D information and camera poses from sparse input views, while off-the-shelf depth and surface normal are obtained from a pretrained network [5]. Points are sampled with respect to the camera from the structure from motion, and the feature value at corresponding points is extracted. SDF values, gradient of SDF values, and RGB values are decoded by simple MLP decoder, and RGB, surface normal, and depth values are extracted along the ray using volumetric rendering. The rendered RGB, depth map, and surface normal are supervised with their respective label using the loss functions LC,LDsubscript𝐿𝐶subscript𝐿𝐷L_{C},L_{D}italic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, and LNsubscript𝐿𝑁L_{N}italic_L start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, while SDF values are supervised with the loss functions LASDFsubscript𝐿ASDFL_{\text{ASDF}}italic_L start_POSTSUBSCRIPT ASDF end_POSTSUBSCRIPT which is composed by LGSsubscript𝐿GSL_{\text{GS}}italic_L start_POSTSUBSCRIPT GS end_POSTSUBSCRIPT, and LwEiksubscript𝐿wEikL_{\text{wEik}}italic_L start_POSTSUBSCRIPT wEik end_POSTSUBSCRIPT.

Related Works

The paper discusses the recent advancements in the field of neural implicit representation, which has led to the development of several approaches to enhance the capabilities of this representation. Specifically, the paper mentions that methods like multi-view consistency, optimization based on camera pose errors, and surface reconstruction have been used to improve the performance of neural implicit representation.

The paper then focuses on the limitations of NeRF, a neural radiance field approach, which requires dense inputs for high-quality novel view synthesis and shows limited performance with sparse inputs. To address this issue, the paper discusses the use of geometric constraints and regularization techniques, such as patch-based geometry and color regularization, ray entropy minimization, and the use of pseudo-geometry and color in unseen views.

The paper also mentions that additional constraints, such as geometric priors, have been adapted to NeRF to improve its performance. Specifically, the paper discusses the use of a monocular depth oracle network, depth priors in indoor multi-view stereo, and surface normal information to enhance the optimization and performance of NeRF.

Finally, the paper states that by utilizing full information from multi-view consistency and geometric cues from sparse input views with a strong geometric smoothing loss, the authors have successfully addressed the limitations of NeRF.

Preliminary

This paper introduces NeRF, a pioneering technique for encoding the appearance and geometry of a scene within a multilayer perceptron (MLP) network to synthesize novel views. Given posed images, the MLP is trained to output color and density at a given query point. The paper proposes a differential volumetric rendering method to synthesize the color of a ray. Instead of using density, this paper designs a unique way of learning the signed distance function (SDF) to represent the scene geometry. The SDF value and its gradient with respect to the query point are used to compute the opaqueness and surface normal at each point, respectively.

Method

The paper uses OmniData [5] to extract geometric priors from the given RGB images. It then uses COLMAP [23] to obtain sparse 3D points and camera poses. To integrate this information, the method constructs multi-level feature volume grids and MLP decoders for SDF and color. The feature of query points along camera rays are sampled using trilinear interpolation, and the results are rendered with the MLP decoders. All of these processes are differentiable, allowing optimization of the feature grids and decoders using loss functions. An overview of the method is provided in Figure 3.

Figure 4:
Qualitative results in a real-world scene. It shows that the Eikonal loss has difficulties in reconstructing surface geometry from a few training images which results in over-smooth depth and color rendering qualities. Top : images, bottom : depth maps.

Figure 4: Qualitative results in a real-world scene. It shows that the Eikonal loss has difficulties in reconstructing surface geometry from a few training images which results in over-smooth depth and color rendering qualities. Top : images, bottom : depth maps.

The paper discusses a method for surface regularization in few-shot neural radiance field (NeRF) training. It addresses the issue that the commonly used Eikonal loss is ineffective when only sparse input views are available, leading to unstable optimization and low-fidelity results.

To remedy this, the paper proposes the Annealing Signed Distance Function (ASDF) loss, which consists of two components:

Geometric smoothing loss: This enforces the estimated signed distance function (SDF) values to match the distance between query points and the rendered surface. The smoothing is applied adaptively, decreasing over the training process to allow the network to first optimize the coarse structure and then refine the details.
Weighted Eikonal loss: This regularizes the gradient of the SDF to be close to 1, as in the standard Eikonal loss.

The paper also describes the overall network architecture, which uses multi-level feature grids and MLPs to predict the SDF and color at query points. The total loss function combines the ASDF loss with rendering losses for color and normal predictions, as well as a depth loss.

The proposed method aims to enable stable and high-fidelity few-shot NeRF training by providing reliable surface supervision through the ASDF loss.

Experiments

$Figure 6: Analysis on geometric regularization loss. For both the Eikonal loss (blue) and our loss function (orange), we plotted the mean (a) and standard deviation (b) for the gradient at the nearby surface (<<< 10cm) and far from the surface (90cm ∼similar-to\sim∼ 1m) against iterations with ideal value (red). In addition, we provided visualizations of the rendered RGB and depth maps for both cases in (c). The left images show the rendered results supervised only by the Eikonal loss for the SDF loss, while the middle images are supervised by our annealing SDF loss. The last column shows the ground truth images. In (c), top row : color images, bottom row : depth maps.$

Figure 6: Analysis on geometric regularization loss. For both the Eikonal loss (blue) and our loss function (orange), we plotted the mean (a) and standard deviation (b) for the gradient at the nearby surface (<<< 10cm) and far from the surface (90cm ∼similar-to\sim∼ 1m) against iterations with ideal value (red). In addition, we provided visualizations of the rendered RGB and depth maps for both cases in (c). The left images show the rendered results supervised only by the Eikonal loss for the SDF loss, while the middle images are supervised by our annealing SDF loss. The last column shows the ground truth images. In (c), top row : color images, bottom row : depth maps.

The paper discusses the implementation details and experimental results of the proposed neural radiance field (NeRF) system.

Implementation Details:

The system was implemented using a single TITAN RTX 24GB GPU, with adjustments to the voxel sizes for each scene to fit the GPU memory and improve performance.
The system utilized a multi-resolution feature grid with i=4 for both SDF and color feature grids.
The decoders were implemented using MLPs with a single layer containing 128 channels.
The learning rates for the decoder and feature grids were set to 0.001 and 0.01, respectively.
For the surface reconstruction parameter, an initial standard deviation of 0.3 and a learning rate of 0.001 were used.
The Adam optimizer was used to optimize the system.
The paper provides the values used for various loss function components, such as λN, λD, λASDF, λwEik, and iopt.
A customized grid sampler was employed due to the lack of differentiable grid sampling in PyTorch.

Dataset and Few-Shot Training Scheme:

The system was trained and evaluated on the ScanNet and NeRF-Real datasets, following the protocols used in previous works.
The ScanNet dataset was used with 18-20 training images per scene and 8 images for validation and testing.
The NeRF-Real dataset was used with 2, 5, and 10 views per scene for training.

Impact of ASDF Loss:

The use of the ASDF loss helped stabilize the optimization process and improved the quality of the rendered color and depth images, especially in few-shot learning scenarios.

Comparison with State-of-the-Art:

The proposed system was compared with NeRF, DSNeRF, DDPNeRF, and SCADE on the ScanNet and NeRF-Real datasets.
The quantitative results showed that the

Conclusion

The paper presents a fast few-shot NeRF (Neural Radiance Fields) method that combines deep dense priors and structure from motion. To address the challenge of optimizing geometric information from sparse input views, the method introduces a new surface regularization loss called the Annealing Signed Distance Function (ASDF) loss. This loss enforces geometric smoothing and improves the performance of novel view synthesis with sparse input views.

The technique successfully connects deep dense priors, multi-view consistency, and multi-resolution voxel grids for novel view synthesis. It can be further enhanced by adapting recent approaches to improve optimization speed. Handling uncertainty in geometric priors may also boost performance by reducing errors.

A limitation of the method is that the ASDF loss requires hyperparameters that depend on the scene geometry or structure from motion results, such as the accuracy of camera poses. Solving this issue in an adaptive manner without heuristic tuning is identified as a potential future direction.

The work was supported by the National Research Foundation of Korea.

This summary was produced with help from an AI and may contain inaccuracies - check out the links to read the original source documents!

Related Papers

Spatial Annealing Smoothing for Efficient Few-shot Neural Rendering

Yuru Xiao, Xianming Liu, Deming Zhai, Kui Jiang, Junjun Jiang, Xiangyang Ji

Neural Radiance Fields (NeRF) with hybrid representations have shown impressive capabilities in reconstructing scenes for view synthesis, delivering high efficiency. Nonetheless, their performance significantly drops with sparse view inputs, due to the issue of overfitting. While various regularization strategies have been devised to address these challenges, they often depend on inefficient assumptions or are not compatible with hybrid models. There is a clear need for a method that maintains efficiency and improves resilience to sparse views within a hybrid framework. In this paper, we introduce an accurate and efficient few-shot neural rendering method named Spatial Annealing smoothing regularized NeRF (SANeRF), which is specifically designed for a pre-filtering-driven hybrid representation architecture. We implement an exponential reduction of the sample space size from an initially large value. This methodology is crucial for stabilizing the early stages of the training phase and significantly contributes to the enhancement of the subsequent process of detail refinement. Our extensive experiments reveal that, by adding merely one line of code, SANeRF delivers superior rendering quality and much faster reconstruction speed compared to current few-shot NeRF methods. Notably, SANeRF outperforms FreeNeRF by 0.3 dB in PSNR on the Blender dataset, while achieving 700x faster reconstruction speed.

6/13/2024

cs.CV

SGCNeRF: Few-Shot Neural Rendering via Sparse Geometric Consistency Guidance

Yuru Xiao, Xianming Liu, Deming Zhai, Kui Jiang, Junjun Jiang, Xiangyang Ji

Neural Radiance Field (NeRF) technology has made significant strides in creating novel viewpoints. However, its effectiveness is hampered when working with sparsely available views, often leading to performance dips due to overfitting. FreeNeRF attempts to overcome this limitation by integrating implicit geometry regularization, which incrementally improves both geometry and textures. Nonetheless, an initial low positional encoding bandwidth results in the exclusion of high-frequency elements. The quest for a holistic approach that simultaneously addresses overfitting and the preservation of high-frequency details remains ongoing. This study introduces a novel feature matching based sparse geometry regularization module. This module excels in pinpointing high-frequency keypoints, thereby safeguarding the integrity of fine details. Through progressive refinement of geometry and textures across NeRF iterations, we unveil an effective few-shot neural rendering architecture, designated as SGCNeRF, for enhanced novel view synthesis. Our experiments demonstrate that SGCNeRF not only achieves superior geometry-consistent outcomes but also surpasses FreeNeRF, with improvements of 0.7 dB and 0.6 dB in PSNR on the LLFF and DTU datasets, respectively.

6/18/2024

cs.CV

RaNeuS: Ray-adaptive Neural Surface Reconstruction

Yida Wang, David Joseph Tan, Nassir Navab, Federico Tombari

Our objective is to leverage a differentiable radiance field eg NeRF to reconstruct detailed 3D surfaces in addition to producing the standard novel view renderings. There have been related methods that perform such tasks, usually by utilizing a signed distance field (SDF). However, the state-of-the-art approaches still fail to correctly reconstruct the small-scale details, such as the leaves, ropes, and textile surfaces. Considering that different methods formulate and optimize the projection from SDF to radiance field with a globally constant Eikonal regularization, we improve with a ray-wise weighting factor to prioritize the rendering and zero-crossing surface fitting on top of establishing a perfect SDF. We propose to adaptively adjust the regularization on the signed distance field so that unsatisfying rendering rays won't enforce strong Eikonal regularization which is ineffective, and allow the gradients from regions with well-learned radiance to effectively back-propagated to the SDF. Consequently, balancing the two objectives in order to generate accurate and detailed surfaces. Additionally, concerning whether there is a geometric bias between the zero-crossing surface in SDF and rendering points in the radiance field, the projection becomes adjustable as well depending on different 3D locations during optimization. Our proposed textit{RaNeuS} are extensively evaluated on both synthetic and real datasets, achieving state-of-the-art results on both novel view synthesis and geometric reconstruction.

6/17/2024

cs.CV

👨‍🏫

Depth Supervised Neural Surface Reconstruction from Airborne Imagery

Vincent Hackstein, Paul Fauth-Mayer, Matthias Rothermel, Norbert Haala

While originally developed for novel view synthesis, Neural Radiance Fields (NeRFs) have recently emerged as an alternative to multi-view stereo (MVS). Triggered by a manifold of research activities, promising results have been gained especially for texture-less, transparent, and reflecting surfaces, while such scenarios remain challenging for traditional MVS-based approaches. However, most of these investigations focus on close-range scenarios, with studies for airborne scenarios still missing. For this task, NeRFs face potential difficulties at areas of low image redundancy and weak data evidence, as often found in street canyons, facades or building shadows. Furthermore, training such networks is computationally expensive. Thus, the aim of our work is twofold: First, we investigate the applicability of NeRFs for aerial image blocks representing different characteristics like nadir-only, oblique and high-resolution imagery. Second, during these investigations we demonstrate the benefit of integrating depth priors from tie-point measures, which are provided during presupposed Bundle Block Adjustment. Our work is based on the state-of-the-art framework VolSDF, which models 3D scenes by signed distance functions (SDFs), since this is more applicable for surface reconstruction compared to the standard volumetric representation in vanilla NeRFs. For evaluation, the NeRF-based reconstructions are compared to results of a publicly available benchmark dataset for airborne images.

4/26/2024

cs.CV